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Foreword 



The Next Generation Information Technologies and Systems (NGITS) work- 
shop series is a biannual event held in Israel since 1993. Like its predecessors, 
NGITS’99 brings together active members of the international research commu- 
nity interested in information technology and knowledge based systems. Many of 
the base technologies in the traditional areas of database management systems, 
information retrieval, and resource optimization, are being deployed nowadays 
in novel systems and applications that flourish with the astonishing increase in 
computational power, storage capacity, communication, and - of course - the 
advent of the world-wide web. These new fronts, in turn, present an ever grow- 
ing set of challenges to the technologies, such as data availability, information 
integrity, and knowledge extraction, fuelling an exciting set of activities. 

Our workshop clearly reflects this trend, offering a rich sample of the state 
of the art at the close of the millennium and a glimpse of what is to come in 
the next one. In response to the call for papers, we received 34 high quality 
submissions, 22 of which were carefully selected by the Program Committee for 
presentation at the workshop and inclusion in these proceedings. These include 
17 full length papers as well as 5 short papers (that will be accompanied by 
demonstrations during the workshop). In addition, it is our pleasure to feature 
two invited talks, given by Professor J. Ullman of Stanford University and IBM 
Fellow C. Mohan. 

We classified the selected papers into a number of broad topics, which were 
also used to organize the workshop sessions: 

— Exploring the World Wide Web 

— Database Technology 

— Storage, Meta Information, Ontologies, and Software Engineering 

— Agent and Workflow Management Technology 

— Data Warehousing and Mining 

We would like to extend our thanks to the many individuals who spared no 
time or effort to contribute to the success of this event: 

The authors, the Program Committee members, and the technical re- 
viewers, Tova Berger, Antje Endemann, Dagan Gilat, Danna Pascal, and 
Nilly Schnapp for their support in organization and logistics. 

Finally, we would like to recognize the NGITS’99 institutional sponsors (listed 
in alphabetic order) and would like to thank them for their generous support: 

The IBM Research Lab in Haifa, 

The Tandem Labs Israel, a Compaq Company, 

The Technion, Israel Institute of Technology, Haifa, Israel. 

Haifa, Israel Ron Pinter, Shalom Tsur 

May 1999 Program Co-chairs 

Oplrer Etzion 
General Chair 
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Some Advances in Data-Mining Techniques 



Jeffrey D. Ullrnan 

Department of Computer Science, 
Stanford University 
Stanford, CA 94305, USA 
ullmanCScs . st anford . edu 



Research in the MIDAS project at Stanford explores new ideas in data-mining. One early result was a new 
algorithm for Web search, that resulted in a recently turned commercial search engine, called Google. 

A second area of interest is in generalizing the techniques such as “a-priori,” which were developed 
by Rakesh Agrawal and his associates at IBM Research in Almaden to allow “market-basket, analysis,” 
or “association-rule mining.” The latter problem deals with finding items that, customers frequently buy 
together. We have developed a, framework called “query flocks.” In this system, we can phrase highly complex 
data-mining queries, including many that, are not, handled well by commercial SQL systems. We then compile 
the “query flock” into a, sequence of SQL queries that, are simple enough to be optimized by commercial 
systems. 

A third interesting challenge is summarizing the knowledge of the Web in a, form that, resembles conven- 
tional relational data. We describe some experiments that, have been carried out, to exploit, the redundancy 
of the Web and discover the patterns in which facts of a, certain kind tend to exist,. 

Finally, we shall talk about, extending the techniques for association-rule mining to extract, relationships 
that, are not, based on “high support,,” i.e., sets of items that, appear very frequently in market, baskets. 
Important, example include intelligence-gathering, where we want, to find terms that, are highly correlated 
in documents, but, that, do not, appear in very many documents. The MIDAS group has recently developed 
some techniques to process very large amounts of data, and detect, efficiently items that, are highly correlated 
but, not, very frequent,. We can even find implications, similar to causal relationships, without, requiring high 
support, for the associated items. 



R. Pinter and S. Tsur (Eds.): NGITS’99, LNCS 1649, p. 1 . 1999. 
© Springer-Verlag Berlin Heidelberg 1999 
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Ziv Bar-Yossef 1 , Yaron Kanza 2 , Yakov Kogan 2 , Werner Nutt 3 , and 

Yehoshua Sagiv 2 

1 Department of Electrical Engineering and Computer Science, 
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Abstract. QUEST is a system for Querying Semantically Tagged doc- 
uments on the World-Wide Web. The advent of new markup languages, 
such as XML, facilitates authoring of Web documents that contain not 
just HTML tags for instructing a browser how to view a document, but 
also contain objects that represent the semantic structure of the docu- 
ment. When such documents become widely available, more powerful 
methods to access and query information on the Web will be possi- 
ble. The QUEST system was designed and implemented for querying 
and manipulating documents written in the markup language OHTML. 
OHTML combines HTML and objects of the OEM data model. QUEST has 
several new features. First, QUEST can be used to query a combina- 
tion of hypertext and object structures. Second, The results of queries 
are OHTML pages and thus of the same type as the data being queried. 
Third, QUEST implements a new approach for querying semistructured 
data that produces meaningful answers even when the input data is in- 
complete, i.e. , when some variables of the query cannot be bound to 
database values. Finally, the experience of developing and using QUEST 
for querying semantic documents on the Web can be useful for the design 
and implementation of query languages for XML. This paper provides an 
overview of the QUEST system and its components. 



1 Introduction 

The enormous growth in the usage of the World-Wide Web as an information 
source suggests that the Web will evolve into a platform with more database 
tools. One major obstacle in the evolution of the Web into one giant database 

* This research was supported by Grants 8528-1-95 and 9481-1-98 of the Israeli Min- 
istry of Science. 
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is the lack of semantics in HTML pages, which makes it difficult to distinguish 
between different pieces of information in Web pages. To overcome this problem, 
we have used OHTML [KMSS98], which enriches HTML with semantic tags that 
define an object structure. In OHTML, the semantic tags are hidden as HTML 
comments and, hence, their existence is transparent to HTML browsers. The 
object structure imposed by OHTML on the data of the Web is in the style of 
the object exchange model (oem) that was proposed for semistructured data (see 
[Abi97,Bun97,PGMW95]). The development of OHTML started before the advent 
of XML. However, we believe that the techniques developed in QUEST for OHTML 
are also applicable to documents formulated in XML. 

This paper describes QUEST, a system for QUErying Semantically Tagged 
documents on the Web. The QUEST system was developed and implemented 
at the Hebrew University of Jerusalem. QUEST treats a set of OHTML pages as 
a semistructured database. The usage of semantic tags allows one to pose more 
precise queries than is possible over untagged HTML pages. We consider tagged 
pieces of information in Web pages as atomic values. QUEST queries refer both 
to the structure of objects and to their atomic values. 

The novel aspects of QUEST include: (1) a graphical query language; (2) the 
possibility to specify queries that retrieve incomplete information, thus taking 
into account the incompleteness of semistructured sources; and (3) answers of 
queries may become extensions of the initial database. 

Motivated by the growing use of the World-Wide Web as a heterogeneous 
information source, many systems for querying the Web were developed. Among 
them W3QL [KS97,KS95], WebLog [LSS96], WebSQL [MM97,MMM97], and 
WebOQL [AM98]. As a further development, systems were designed for Web site 
management, e.g., Strudel [FFK+98] and Araneus [AMM97,MAM + 98]. In com- 
parison to these systems, QUEST has the advantages of (1) using semantic tags 
for more accurate querying, and (2) dealing robustly with incomplete informa- 
tion. The second novelty is also an advantage when comparing QUEST to query 
languages, such as Lorel [AQM + 97,MAG + 97,QWG + 96] and UnQL [BDHS96], 
that were developed for querying semistructured data in general, and not just 
in the context of the Web. 

Today, as XML [Con98] is becoming a standard, querying semantically tagged 
documents is an important research issue. Languages such as XML-QL [DFF+98] 
and XQL of Microsoft have been proposed recently. These languages are at an 
early stage of implementation, and we believe that our experience can be useful 
for the design and implementation of query languages for XML. 

2 Data Model 

QUEST is a system for querying hypertext documents that also embed some 
object structures. Due to the diversity of the Web, we cannot expect that doc- 
uments with object structures will conform to a fixed schema, as in a classical 
database. Object structures that show some regularity, but do not follow a strict 
explicit schema, are captured by semistructured data models [Abi97] . 
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Portfolios Guide 




Fig. 1 . A Portfolios Guide as an OEM database 



We choserthe Object Exchange Model (oem) of [PGMW95] as the data model 
for the semantic layer in our system. OEM is a semistructured data model that 
represents databases as labeled directed graphs. Each node of a graph is an object 
with a unique object identifier (oid). Some objects have names and are called 
named, objects. The named objects are the entry points to the database, and 
the names serve as aliases to those objects. Each object in the database must 
be reachable from some named object through a path in the database graph. 
An object that is not reachable cannot be accessed and is therefore ignored. An 
atomic object is an object that has no outgoing edges. It contains a value of 
an atomic type, such as integer, real , string, gif, html, audio, java, etc. Objects 
that have outgoing edges are complex objects. Figure 1 shows, as an example, a 
database graph of a portfolio guide. 

Since databases and queries are graphs of a similar structure, we introduce 
a common abstraction, called skeletons. A skeleton is a directed graph with, a, 
partial function v that assigns names to some of the nodes in the graph, such 
that distinct nodes have distinct names and each node is reachable from a named 
node. A database is a skeleton with two functions, one that maps edges to labels, 
and one that maps atomic nodes to values. 

•TEjsing skeletons as a common abstraction of the basic components provides a 
high degree of uniformity, at both the conceptual level and the implementation 
level. In the implementation of QUEST, this is reflected in the class hierar- 
chy, where databases, query graphs and result graphs are all extensions of the 
“skeleton” class. 

One purpose of the recently proposed markup language XML is to express 
the semantics of certain parts of a document by means of markup tags. We 
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<HTML> 

<TITLE> Index of Portfolios Guide </TITLE> 

<B0DY> 

< ! — (LABEL) Portfolios Guide (/LABEL) — > 

<! — (OBJ id=&l type=set name="Portf olios Guide") — > 
<H3>Portf olios Guide :</H3> 

<UL> 

<LIXA HREF="eyal .html"> 

< ! — (LABEL) — >portfolio<!— (/LABEL) — ></A> 

<!— (0BJREF)eyal.html#&0(/0BJREF) — > 

<LIXA HREF="discount .html"> 

< ! — (LABEL) — >portfolio<!— (/LABEL) — X/A> 

<!— (OB JREF) discount. html#&0 (/OBJREF) — > 

</UL> 

< ! — (/OBJ) — > 

<HR> 

<CENTER> This page is a simplified version of the 

OHTHL page with portfolio suggestions. </CENTER> 

</B0DY> 

</HTHL> 



Fig. 2. An OHTML page with tags 



started our project before the advent of XML and created the tagging language 
OHTML [KMSS98]. OHTML is an extension of HTML that superimposes an OEM 
object structure on top of an HTML page, by adding semantic tags that are 
hidden inside HTML comments. Thus, one can tag an HTML document without 
affecting the display of the document by a browser. The tags are used to define 
objects and references among those objects. Thus, a set of OHTML document 
contains a textual representation of an OEM database. 

Figure 2 shows OHTML coda that defines a Portfolios Guide object with two 
portfolio subobjects, similarly to the Portfolios Guide database of Figure 1. Note 
that the tags of a subobject are nested inside the tags of the parent object. 

In order to interpret OHTML documents as OEM graphs, we add object iden- 
tifiers (oid’s) to the objects defined by OHTML tags. Thus, objects can be refer- 
enced, and each object has a unique oid. The oid of an object is a combination of 
a uniform resource locator (url) and the offset of the object from the beginning 
of the page. URLs also provide entry points to the database, since browsers are 
capable of reaching a Web page through its URL. 

In OHTML one can also use references to object id’s. In Figure 2, for example, 
there are references to two subobjects having the oid’s eyal.html^&O and dis- 
count. html^&O. These two subobjects are children of the object having the oid 
&1 and are connected to their parent via edges labeled with portfolio. Since the 
two subobjects are not located physically immediately after the labeled edges 
leading into them, oid references are used. Finally, OHTML also allows one to 
declare the type of each atomic node, e.g., integer, gif, java, etc. 
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3 QUEST and How It Is Used 

In this section we give an overview of QUEST from the user’s perspective. We 
show how a user can formulate queries and view their results. In later sections, 
we will describe QUEST’S query language and its components in more detail. 

To illustrate the usage of our system, we rely on a running example based 
on the Web site of the Israeli economic magazine GLOBES [GLO], which holds 
information about, the Israeli economy. We concentrate on the part of the site 
that deals with stock portfolios recommended by financial analysts. A portfolio 
consists of a group of companies recommended for investment. There is a general 
style in the design of the HTML documents that describe portfolios. However, 
each one of the portfolios has its own particular schema. The schemas differ in the 
attribute names and in the hierarchies of the objects they contain. For example, 
some portfolios are flat, lists of companies preceded by a. short, introduction while 
other portfolios group companies by sectors, e.g., “Electronics,” “Chemical,” etc. 
The portfolio pages are a. good practical example for semist.ruct.ured data. They 
contain incomplete data, without, a. strict, schema., and they contain concrete 
information, such as prices, dates, etc., along with descriptions, images, links, 
etc. It. seems natural that, one would like to ask queries against, these pages. For 
our experiments, we copied pages of the GLOBES site to our computer, tagged 
these pages with OHTML tags, and queried them in QUEST. 

3.1 Overview of the Querying Process 

We consider a. set. of OHTML documents as a. database. A database has two 
aspects. The visual me w is the visualization of the HTML part, of the documents 
as shown by a. browser. The semantic view is the second aspect., and it. is the 
graph structure of the set. of objects contained in the database. Figure 3 shows 
how QUEST provides simultaneously the two views of the GLOBES database. 
The existence of two parallel views for each OHTML document, is due to the 
combination of HTML tags and OEM structures in the documents. 

The display of the database graph in the visual view familiarize the user with 
the structure of the database and thus, gives her the ability to design meaningful 
queries. A QUEST query essentially consists of two graphs, the query graph and 
the result graph. The query graph determines how the object, graph of a. database 
is explored and how data, are retrieved. The result, graph describes the object, 
structure produced by the query. Both, the query and the result, graph are drawn 
using a. graphical user interface. In Section 4 we discuss query graphs and their 
evaluation over a. database. In Section 5 we cover the usage of a. result, graph for 
the creation of result, pages when submitting a. query. 

QUEST is a. client-server system. The query is created in the client, part., and 
when submitted, it. is transferred to the server for evaluation over the database. 
After evaluating a. query, the server creates the query result., which is an extension 
of the given database. That, implies that, the result, is a. set. of OHTML documents. 
The location of the result, in the database is then sent, to the client, as a. URL 
and that. part, of the extended database is displayed. The display of the result. 
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Fig. 3. The two aspects of a database — semantic (left) and visual (right) 



contains both the semantic view and the visual view. As a query system that 
facilitates the querying process we just described, QUEST is a combination of 
tools that allow a user to browse a database, to construct a new or edit an 
existing query, to evaluate a query over a given database, and to construct the 
result . 

A query in QUEST is evaluated in three phases. The first phase is the search 
phase in which information is extracted from the database. In this phase the 
query graph is matched to the database graph in search for similarity of patterns. 
We thus call matchings to the result of the search phase. The second phase is 
the filtering phase in which the extracted information is subjected to additional 
constraints. We call solutions to the matchings that remain after the filtering 
phase. The third phase is the construction phase in which an extension of the 
existing database is constructed from the extracted information. The result of 
the construction phase is called the query answer. 

QUEST allows incomplete answers when querying incomplete information. 
Yet, we require an answer to have a maximal information content. The distinction 
between searching and filtering is necessary in order to apply some constraints 
only to matchings that contain maximal information (see Section 4). 
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3.2 The Components of a Query 

A query consists of two main parts. The first part defines the information to be 
extracted from the database and the constraints for filtering that information. 
This part plays similar roles as the FROM and the WHERE clauses in an SQL 
query. The evaluation of this part is the search phase and the filtering phase. 
The second part defines how to create the result from the information that was 
found in the first phase. That role is similar to the SELECT clause in SQL and is 
used for the construction phase. 

The main parts are further divided into the following components: 

The Query Graph is a graph that is matched against the database graph 
during the search phase. Figure 4 shows the graph of a query that searches 
the Portfolios Guide database for high-tech companies whose market per equity 
value is greater than 10. 

The Search Constraints are a sqt of constraints that further specify how to 
match the query against a database. 
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Fig. 5. The result graph and a node template 

The Filter Constraints are applied during the filtering phase and filter the 
information that was found in the search phase. 

The Result Graph is a graph that defines the graph structure of the result 
database. It defines which new database objects are to be created as the result of 
a query, and how to connect these objects to each other and to existing database 
objects. The result graph defines the semantic view of the result database. An 
example of a result graph is given in Figure 5. 

The Templates are textual entities that specify how to construct the HTML part 
of the result database and how to combine the OEM structure of the result with 
HTML segments to create OHTML pages. The templates define the visual view of 
the result database. In the graphical user interface, one can decorate every node 
in the result graph with a node template, as shown in Figure 5. Alternatively, 
one can create template files (see e.g. Figure 7) that also contain a description 
of the result graph. 

The result database, similarly to the original database, is a set of OHTML 
documents. The system displays the graph of the result database as an answer to 
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the query. The query graph, the constraints, the result graph, and the templates 
in Figure 4 and 5 together form a query. The result of posing this query to the 
Portfolios Guide database is shown in Figure 6. 

4 QUEST Queries and Their Evaluation 

In Section 3.2, we introduced the components of a query. We now show how a 
query graph with search constraints and filter constraints is evaluated. 

A query graph is a skeleton, where each node and each edg# is associated 
with a distinct variable. Each edge also has an edge label, which is a simple 
string or a regular expression over strings. Figure 4 shows an example of a query 
graph. Note that variables associated with edges are not shown. We sometimes 
say “node variable A'” when we mean “the node associated with variable x." 

4.1 Matchings 

The query graph is matched against the database graph by mapping each node 
variables of the query graph to a node object on the database, and mapping each 
edge variable to an edge or a path in the database. We distinguish between total 
matchings, i.e. , mappings in which all query variables are bound to database 
nodes or paths, and partial matchings in which some variables remain unbound 
(and are assumed to be mapped to the symbol _L, called null). 

Due to the semistructured nature of the Web, data in the Web do not con- 
form to a rigid schema, and thus the data may be incomplete. Allowing only 
total matchings for a query is too restrictive, since this would assume that in- 
formation appears in certain concrete patterns and is complete. For this reason, 
QUEST has been designed so that it can handle incomplete information and 
return incomplete answers. 

The definition of matchings is based on viewing edge labels in a query as con- 
straints The label of an edge in a query graph is a constraint, since only certain 
edges or, more generally, paths of the database match that label; furthermore, 
the topology of the matching portion of the database must be the same as the 
topology of the query graph. More precisely, suppose that e is an edge variable 
in the query graph, such that l is the label of e (l is a string or, in general, a 
regular expression) and e links the node variable x to the node variable y. Let 
p be an assignment to the variables of the query, i.e., p maps node variables 
to database objects and edge variables to paths in the database graph. We say 
that p satisfies the edge constraint of e if the path tt = p(e) that is assigned to 
e satisfies the following. 

1. 7 r is a path in the database from p(x) to p(y), i.e., the source of e is mapped 
to the source of 7r, and the target of e is mapped to the target of tt; 

2. the sequence of edge labels on the path 7r satisfies the regular expression l. 

If the label l of the edge e is just a string, the last condition means that p(e) 
consists of a single database edge that is labeled with the same string l. 




Querying Semantically Tagged Documents on the World-Wide Web 



11 



We also view the names of the named nodes in the query graph as constraints. 
We say that the name constraint n is satisfied if the query node with the name 
n is mapped to a database object with the same name. 

Usually, query languages consider only total matchings, i.e., partial informa- 
tion is ignored when answering a query. Total matchings are defined as follows. 

Definition 1 (Total Matchings). A total matching is an assignment of ob- 
jects and edges of the database to the variables of the query, such that each name 
constraint is satisfied and each edge constraint is satisfied. 

That is, a total matching requires all variables in a query to be bound and all 
constraints to be satisfied. 

In partial assignments, node and edge variables may remain unbound, i.e., 
variables are mapped either to database entities or to _L. Thus, the requirement 
that all edge constraints and name constraints be satisfied has to be relaxed. 
Essentially, we require that name and edge constraints be satisfied only when the 
corresponding variables are assigned non- null values; moreover, we also require 
that the portion of the query graph that is assigned non- nulls will be a skeleton. 

Formally, we say that a partial assignment is defined for a node (edge) vari- 
able if it maps the nodd: (edge) to a non-null object (database edge). Partial 
matchings are defined as follows. 

Definition 2 (Partial Matchings). A partial assignment p is a partial match- 
ing if it has the following properties. 

1. if p is defined for a named node of the query, then the name constraint is 
satisfied; 

2. if p is defined for an edge of the query, then the edge constraint is satisfied; 

3. the edges and nodes for which p is defined form a skeleton. 

Condition 3 means that if x is either a node or an edge that is assigned a non-null 
value, then there is a path from a named node to , 'asm d i that p assigns non- null 
values to all nodes and edges on that path. 

4.2 Constraints 

In addition to the constraints implicit in the query graph, explicit constraints 
can be specified. Explicit constraints are either search constraints or filtering 
constraints. Furthermore, in the presence of nulls, Constraints can be satisfied 
either weakly or strongly. We first, define weak and strong constraints. 

Constraints are expressions combined of Boolean operators and atomic ex- 
pressions. Atomic expressions are either constants or variables that occur in the 
query graph. A variable in a constraint can be bound either to a value of an 
atomic database node or to an object identifier of a node. Thus, we have two 
sets of comparison operator: Cj, = {<, <, >, >, ==, ! =, = } for comparing val- 
ues and Cj = {=o=, !o=} for comparing the identities of database objects. A 
simple constraint is a constraint of the form ai 0 a. 2 , whelp a.\ and a.', are atomic 
expressions and 9 is a comparison operator from Cj, or Cj. 
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To take into account partial matchings, we define two ways to evaluate con- 
straints with respect to an assignment: strong evaluation and weak evaluation. 
The point is that we still want to evaluate constraints if the assignment is un- 
defined for some query variables. 

Consider a simple constraint a\9a2 and a partial assignment g for the vari- 
ables in the query. For the sake of simplicity, we adopt, the convention that g is 
defined for all constants and maps a constant to itself. The comparison opera- 
tors in C' v expect atomic values as arguments. If the variable in the argument 
position is bound to a complex database object, the constraint is not satisfied. 
In the other cases, the constraint can be evaluated in the following two ways: 

1. Strong Evaluation: the constraint ai9ao is satisfied if g is defined for both 
a.\ and a.'> and the values to which a.\ and a.'> are bound by g satisfy 6 ; 

2. Weak Evaluation: the constraint ai9ao is satisfied if one of the following is 
true: (1) the assignment g is not defined for a.\ or g is not defined for a.o; 
(2) the assignment g satisfies a\9a2 under strong evaluation. 

If T is assigned to some variable of a simple constraint, then the constraint 
is never satisfied under strong evaluation and is always satisfied under weak 
evaluation. If a constraint is satisfied under strong (weak) evaluation, we say that 
it is strongly satisfied ( weakly satisfied). Satisfaction of Boolean combinations of 
simple constraints can be defined in the obvious way. Each constraint in a query 
is entered either as a weak or as a strong constraint. 

4.3 Search Constraints and Maximal Matchings 

For each explicit constraint, the user specifies whether it is weak or strong and, 
furthermore, whether it is to be used in the search phase or in the filtering phase. 

During the search phase of the query evaluation, QUEST constructs match- 
ings for the variables of the query graph. These matchings must satisfy Defini- 
tion 2 and, furthermore, each explicit search constraint must be satisfied either 
weakly or strongly, as specified by the user. 

The partial matchings constructed during the search phase may exhibit some 
redundancies, since a partial matching g may yield another partial matching g' 
by making g defined for fewer variables. Formally, we say that g subsumes g' 
if for every variable x for which g' is defined, g(x) = g'(x). In other words, g 
is the same as g 1 , except that g may be defined for some entities in the query- 
graph for which g' is not defined. We say that a matching is maximal if it is not 
subsumed by any other matching. To avoid redundancies as well as unnecessary 
computations, the search phase should only construct maximal matchings. Note 
that maximal matchings cannot be extended over the given database without 
violating some constraint. Intuitively, maximal matchings contain maximal in- 
formation, which is the best we can expect when information in the database 
may be incomplete. Maximal matchings can be viewed as a generalization of the 
notion of full disjunction [RU96,GL94]. 

Consider the query graph in Figure 4. Table 1 shows the maximal partial 
matchings produced by the query, when evaluated over the database depicted in 
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Figure 1. For easier comprehension, we have replaced the oid’s of atomic objects 
by their values. We only show the assignments to the node variables, since (in this 
example) these assignments uniquely determine the assignments to the edges. 
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Table 1 . Maximal matchings of the query in Figure 4 



The partial matchings in Table 1 are maximal, since none of the null values 
in each matching can be replaced by a database object in a way that will satisfy 
the edge constraint in the query graph. 

4.4 Filter Constraints 

During the second phase of the query evaluation, the maximal matchings from 
the search phase are filtered. Filter constraints are either strong constraints or 
weak constraints, as specified by the user. The maximal matchings that satisfy 
all the filter constraints are called solutions. 

There is a need for both strong constraints and weak constraints due to 
the presence of partial information. The basic difference between the two is 
that strong constraints are only satisfied if variables are bound, and thus certain 
information is required to be present in order to satisfy strong constraints. Weak 
constraints do not require information to be present. If information is available in 
et given matching and that information violates the constraint, then the matching 
is dismissed; however, if the information is not available, then the matching 
gets the benefit of the doubt and is retained. For example, if a query asks for 
companies that have a market value of at least 500 million dollar, then we will 
not receive in the result companies for which it is known that their market value 
is below that figure, but we may receive companies for which the market value 
is unknown. 

For strong constraints, it does not matter whether they are applied during 
the search phase or during the filtering phase. The reason for that is that a strong 
constraint is satisfied only if all the variables in the constraint are assigned non- 
null values. Consequently, it is advisable to apply strong constraints as early as 
possible during the search phase in order to prune the search space. 

For weak constraints the situation is different, since a weak constraint may 
be satisfied by changing the assignments of some variables to nulls. Therefore, it 
makes a difference whether a weak constraint is applied during the search phase 
or during the filtering phase. Note that if a new weak constraint is added to 
a query as a search constraint, then it may change the result by forcing some 
null assignments to variables that previously were assigned non- nulls. However, 
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Fig. 6. The result database produced when evaluating our example query (the query 
graph of Figure 4 and the result graph of Figure 5) over the database of Figure 3 



the new weak search constraint will not decrease the number of solutions to 
the query. If the same weak constraint is used in the filtering phase, then it may 
decrease the number of solutions to the query. Intuitively, weak search constraints 
have the effect of luring maximal matchings to be as large as possible,. Once the 
maximal matchings are produced, the filter constraints are used to eliminate 
some of those matchings. In fact, one reason for having maximal matchings as 
the result of the search phase is in order to give as much elimination power as 
possible to the weak filter constraints. 

Note that the edge constraint and the name constraint (Definition 2) are a 
form of weak search constraint. 

The set of solutions, which is obtained as the outcome of the filtering phase, 
is used for the creation of the result. We discuss this topic in the next section. 



5 Constructing Results 

The result of a query in QUEST is a set of OHTML pages that extend the database 
over which the query is posed. Since the search and filtering phases produce sets 
of partial matchings, we need a mechanism to convert those partial matchings 
into OHTML pages. When creating OHTML pages, we must take into account the 
two aspects of OHTML, namely, the semantic view and the visual view. Thus, the 
answer returned by a query must include an OEM graph, and that graph must 
be embedded in HTML pages by means of OHTML tags. 

In principle, there are two ways to combine the two aspects, depending on 
which aspect is given priority. The first approach is to produce an OEM graph and 
decorafg its nodes and edges with HTML. The second is to create HTML pages, and 
embed in those pages objects and edges of the OEM structure. Both approaches 
lead to OHTML pages. In QUEST, there are mechanisms realizing each of the 
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two approaches. We will discuss only the first, one, which gives priority to the 
semantic structure when constructing the answer. 

We use two formalisms in the creation of the result. The first is a result graph 
that determines the OEM structure of the answer. The second is a set of OHTML 
templates that are used to decorate the OEM structure with HTML tags. 

5.1 Result Graph 

When creating an OEM structure out of the solutions to the query, two main 
tasks have to be fulfilled. The first is to create new objects, and the second is to 
create edges among these new objects and edges from the new objects to other 
database objects. 

These two functions are accomplished by means of a result graph. A result 
graph is essentially a skeleton with edges that are labeled with edge labels and 
nodes that are labeled with flat terms. A flat term is either a variable or a 
term of the form f[x i, . . . , *„], where / is a Skolern function and x\, . . x n are 
node variables occurring in the query graph. The idea is that new objects are 
generated by applying Skolern functions to existing objects. 

The solutions of a query are (partial) assignments of database objects to 
node variables of the query. Suppose that f[x\, . . . , r„] is a flat term in the 
result graph. If p is a solution, then [p(xi), . . . , p(x n )] is a tuple of database 
objects with oid’s, say, oi, . . . , o n . For each solution p, we create a new object 
with the oid /[oi, . . . , o n ]. Note that if two solutions p.\ and and p.o are equal on 
all the Xi, then only one object is created for them. 

There are different ways to handle tuples with nulls. One approach is to create 
new objects only when all the variables of a term are bound to non- null values. 
Flowever, this approach is too restrictive. Instead, we treat each null value as a 
unique database object. In this way, we take into account partial solutions that 
may not bind all variables, and we utilize this partial information in order to 
create new objects. 

Summarizing, new objects are generated as follows. First, in each solution 
p, replace every null with a new unique non- null value. Secondly, for each so- 
lution p and each flat term f[x\, . . . , *„], create a new object having the oid 
f[p(x i), . . . , p(x n )\ (duplicates are removed). 

Once the result objects are generated, edges are introduced between them 
according to the edges in the result graph. Suppose that there is an edge, labeled 
with /, from node r? i of the result graph to node no. If there is a solution p, such 
that p generates objects o\ and Oo from the flat terms of n\ and nf, respectively, 
then we create an edge, labeled with l, from o\ to oo. 

Suppose that n is a leaf node (i.e. , a node without any outgoing edges) of 
the result graph, and let t be the flat term of n. If o is an object created from a 
solution p and the flat term f, then o is an atomic object. Since atomic objects 
have values, each leaf node of the result graph has an associated string s. The 
string can include variables of atomic nodes of the query graph. Such variables 
should be enclosed by the $ sign, i.e., $x$. Note that the string may be just a 
variable. The variables in the string are instantiated according to the solution 
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fi., and the instantiated string becomes the value of the atomic object generated 
from p. and the leaf node. Since an atomic object can have just a single value, 
each variable appearing in the string of a leaf node of the result graph must 
also appear in the flat term of that leaf node. This requirement guarantees that 
an atomic object is created for each distinct value that is produced by applying 
solutions to the string of the leaf node. 

A special case of a flat term is a variable, e.g., x. In this case, no new objects 
are created, since no Skolern function is applied to existing oid’s. Therefore, 
when a variable is used as a term, it actually defines connedfions between result 
objects and objects of the database over which the query is evaluated. Since we 
want to avoid situations in which the result graph implies that a new outgoing 
edge has to be added to an existing object, we allow a variable as a term only 
in leaf nodes of the result graph. This requirement guarantees that new edges 
are added only between two new objects, or between a new object and an object 
that already exists in the database. 

QUEST requires result graphs to be acyclic. In addition, the list of variables 
in the term of each node must include all the variables of its parent. This require- 
ment is due to the following reason. Assume, for example, that f[x] is a parent 
term and g[y\ is a child term. Let pi = {x/oi, y/o e } and p.o = {x/oo,y/o e } be 
two solutions. Then /[oi], f[oo] and g[o e \ are the newly generated objects. When 
OHTML pages are created they contain these objects, and each one of the objects 
f[oi] and /[oo] must encapsulate in its OHTML representation the representation 
of the object g[o e \. d ims, we need to have the ability to break an HTML page 
into pieces stored in more than one physical location. This resemble t.h# usage 
of parameter entities in XML, but is not an HTML feature. 

QUEST automatically adds missing variables to flat, terms, when those vari- 
ables are needed according to the requirements specified in this section. 

Since only one object can exist in the uppermost level of each OHTML page, a 
dummy root object is created for each OHTML page produced in the result. Such 
a root object encapsulates the objects in the uppermost level of the page and 
the HTML text that appears before and after those objects. In order to create 
root objects, the root of the result graph is required to have a flat, term that, 
has the Skolern function symbol root and some variables. Each instantiation of 
this flat, term by some solution p. will create a. new root, object, that, will reside 
in a. new OHTML page. Thus, the root term defines the partition of the result, 
database into pages. 

5.2 OHTML Templates 

QUEST uses OHTML templates to create the HTML that embeds the OEM struc- 
ture of the result.. In the query interface, one can add to a. given node a. pre- 
ceding and a. succeeding HTML text.. Actually, the text, is HTML with references 
to variables of atomic nodes of the query. The variables in the text, segments 
are instantiated to the atomic values to which they have been bound, and the 
instantiated text, segments surround the result, objects that, are created from the 
given node. 
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<HTML> 

<!— (LABEL) Report (/LABEL) — > 

<! — (OBJ id=report [] type=set name="Report") — > 
<B0DY bgcolor=white> 

< ! — (LABEL) Company (/LABEL) — > 

<! — (OBJ id=company [x3] type=set) — > 

<HR> 

<P> 

<H3XCENTER>$x4$</CENTERX/H3> 

<P> 

Recommended by : <I>$x5$</I> 

<P> 

Market per equity: $x8$ 

< ! — (/OBJ) — > 

</B0DY> 

< ! — (/OBJ) — > 

</HTML> 



Fig. 7. A textual representation of a template 



In the actual implementation, the system produces, for each node in the 
result graph, a node template that consists of the term of the given node and 
the surrounding HTML text. The node template defines the visual display of the 
objects that are created from that node. Each node template consists of param- 
eterized HTML text and OHTML tags that define the objects to be created from 
the template, as well as the edges leading to immediate subobjects and to other 
objects that are referenced by the given object. Hence, each node template is 
essentially an OHTML document with variables from the query. The node tem- 
plates are combined into a query template , which may consist of one ore more 
OHTML pages (with variables). 

New OHTML pages are generated from a query template by evaluating the 
template over the solutions to the query. For each solution, each variable of the 
template is bound either to a complex database object (more precisely to an oid 
reference), to the value of an atomic object, or to a null. 

Figure 7 shows a query template, where report and company are two Skolern 
functions, and x4, x5 and x8 are variables embedded in the text. 

6 Conclusions 

We have designed and implemented QUEST — a graphical query language for 
semantically tagged pages on the Web. QUEST was implemented in Java and 
thus has the benefits of Java applications, such as system independence, object 
oriented design, etc. QUEST has a client-server architecture, where the client is 
a Java applet. It uses a main memory approach when querying the Web. 

The most novel feature of QUEST is its ability to query incomplete infor- 
mation and return incomplete maximal answers as a result. We believe that the 
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ability of QUEST to query incomplete information is of great importance, due 
to the semistructured nature of the Web. We also believe that the mechanism of 
finding maximal matchings is natural for querying partial information in general, 
and not just in QUEST. For more details on the foundation of query processing 
in QUEST see [KNS99], 

QUEST provides a graphical query language. We believe that the graphical 
interface facilitates easy and succinct formulation of complex queries that may 
involve a number of path expressions and constraints. Similar queries are not ex- 
pressed as easily in textual query languages, such as Lorel [AQM+97]. Moreover, 
the graphical interface also facilitates construction of new Web pages that have 
both HTML tags and object structures. Hence, the principles of this graphical 
interface may also apply to querying XML documents and generating new pages 
for the result by means of style sheets. 

Currently, our main effort is to alter the system to support XML. We believe 
that the principles used in QUEST are sufficiently general and important to be 
carried over to query languages for XML documents. 
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Abstract. The World-Wide Web presents new challenges to database 
researchers, especially in the area of query processing. Currently, query- 
ing the World-Wide Web is done by using online indices. These sites 
employ search engines, known as “robots”, that scan the network peri- 
odically and form text based indices. A severe limitation of these search 
services is that the structural information, namely the organization of 
documents into parts pointing to each other, is lost. Several tasks, rang- 
ing from data mining to Intranet management, require the analysis of 
the hypertext structural organization. 

In this paper, we propose a simple graph based query language. In this 
language, both the query and its target are graphs. We present and 
evaluate the efficiency of a general class of algorithms for answering graph 
queries. The algorithms’ definitions take into account two important facts 
of the WWW : ( 1 ) efficient algorithms must minimize the communication 
needed to answer a query and ( 2 ) query evaluation involves a process of 
data graph exploration. 



1 Introduction 

The World-Wide Web presents new challenges to database researchers, espe- 
cially in the area, of query processing. Currently, querying the World-Wide Web 
is done by using online indices such as Lycos, Infoseek, AltaVista 1 and others. 
These sites employ search engines, known as “robots”, that scan the network 
periodically and form text, based indices. A severe limitation of these search ser- 
vices is that the structural information, namely the organization of documents 
into parts pointing to each other, is lost,. Several tasks, ranging from data, min- 
ing to Intranet, management,, require the analysis of the hypertext, structural 
organization. 

Many data, organizations resemble directed graphs, especially hypertext,. This 
has resulted in the design of query languages that, view their target, data, as 
graphs [6, 11]. In the context, of object, oriented databases, path expressions can 

* Work supported by grant 8528-95-1 of the Israeli Ministry of Science and the Arts 
and by the Fund for the Promotion of Research at the Technion. 

1 wra.lycos.com, www. infoseek.com and www.altavista.digital.com 
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also be viewed as traversing graphs [9]. Such languages are also closely related 
to the semi-structured databases of [1] . 

In this paper, we propose a. simple graph based query language. In this lan- 
guage, both the query and its target, are graphs. We present and evaluate the 
efficiency of a. general class of algorithms for answering graph queries. The algo- 
rithms’ definitions take into account two important facts of the WWW: (1) the 
dominating cost is that of communication, (2) the structure of the data, graph is 
usually unknown prior t.o query evaluation. The consequences of these facts are 
that: (1) efficient, algorithms must, minimize the communication needed t.o answer 
a. query and (2) query evaluation involves a. process of data, graph exploration. 

Consider the following example. In a. CS Dept.., the 
faculty WWW pages are organized as follows. There is 
an index page containing a. hypertext, link for each faculty 
member. These links lead t.o pages that, contain general in- 
formation about, the faculty members. These pages point., 
in turn, t.o the faculty members’ home pages. Home pages 
point, t.o pages of the courses t.a.ught. by the faculty mem- 
bers. An example of a. graph query is: find pairs of faculty 
members’ home pages that, point, t.o the same course page. 

Given the WWW data, graph, the query graph is shown in 
Figure 1. The solution of the query is a. set. of mappings 
from the query graph t.o the data, graph such that: node 
I is mapped t.o the site’s index page, nodes a and b are 
mapped t.o the information pages of the two faculty mem- 
bers, nodes a! and b' are mapped t.o their home pages and C is their common 
course page. 

Related Work There exist, several query languages for hypertext, systems that 
address the problem of analyzing hypertext, structural information. Graph-log [5] 
can be used for the graphical specification of search patterns. In [3], Beeri and 
Korna.t.zky define a. logic-based language t.o state queries on hypertext, struc- 
tures. A query language on a. dynamically changing hypertext, is proposed in 
[15]. W3QL [11] and WebSQL [14] are two languages defined especially for the 
WWW. Weblog is a. language, based on Da.ta.log, t.o query and restructure WWW 
information [12]. However, t.o date, lit.t.le has been done in defining what are the 
requirements from a. general hypertext, query language, what are the basic prob- 
lems involved in answering such queries and what optimization techniques should 
be used. 

We have built. W3QS [11], a. query system for the World-Wide Web which 
uses W3QL, a. graph based language. W3QS is limited t.o simple queries and 
uses naive algorithms for query processing. Extending W3QL’s query processing 
capabilities was the primary motivation for this paper. It. is also conceivable 
that the techniques developed here will be useful for handling queries on semi- 
structured data. [1, 10] and queries on XML [4] data.. 

In [13], the authors propose a.n algorithm t.o answer a. particular class of 
graph queries: the regular simple path queries. Their algorithm does not. take 




Fig. 1 . An Ex- 
ample of a. Query 
Graph 
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into account, the cost of building the data, graph and considers it as free. Data, 
graph creation is certainly not free in the context of the WWW. 

Paper Organization Section 2 presents the main definitions and the class of 
algorithms being studied. Sections 3 describes several techniques for optimiz- 
ing graph queries. Section 4 presents results of experimenting with optimization 
techniques. Section 5 introduces extensions such as content conditions and em- 
bedded XML objects. Finally, section 6 presents future work and conclusions. 

2 Graph Queries 

In this section we introduce graph queries with no content conditions, i.e., graph 
queries that focus on the hypertext structure. In section 5, we show how the 
algorithms presented here can be adapted to handle conditions on the content 
of the hypertext nodes and links. 

2.1 Basic Definitions 

A graph query instance is defined by two graphs 2 : the query graph and the data 
graph. The answer to a. graph query instance is a. set of subgraphs of the data, 
graph onto which the query graph is mapped. These notions are formalized in 
the following definitions. 

Since the World-Wide Web may be viewed as a. directed graph (as in [11]), 
it can serve as the data, graph of a. graph query instance where the query graph 
corresponds to the searched hypertext structure. 

In [2], the authors consider the WWW as infinite. Without making this as- 
sumption, our model takes into account, the size of the WWW by defining practi- 
cally computable graph query instances. In these queries, the query graph search 
is restricted to a. subset of the WWW by the Starting Point Function (SPF) 
which maps nodes in the query graph to sets of nodes in the data, graph. The 
SPF models information on hypertext nodes whose possible addresses (UR.Ls) in 
the WWW are known. Therefore, the definition of practically computable query 
instances using the SPF avoids the problem of an infinite search space. 

Definition 1. A Starting Point Function (SPF), say f, from graph G = (V,E) 
to graph G' = (V',E') is a finite partial function from V to sets of nodes ofV', 
i.<. f : L • • 2'". 

We denote the starting point function 

f(a i) = {&i, — , by 

/ — {^1 1 * {^1 > * * * > & n \ t * * * •> & m 1 * {^1 > * * * •> 

A mapping function defines how the query graph is to be mapped on the data, 
graph. There are two different semantics for the mapping function: Distinct and 
Non-Distinct. 

By graph, we mean finite directed graph with no self edges. 



2 
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Definition 2. A Mapping Function (MP) in the Non- Distinct semantics, say 
m, from graph G = ( V, E) to graph G' = ( V' , E ') is a total function from V U E 
to V' U E' that maps nodes in G to nodes in G' and edges in G to edges in G' 
such that Ve E E, e = (v,;, Vj) =>- m(e) = (m(i>i),m(vj)) E E ' . 

For the Distinct semantics, the definition is identical to definition 2, except 
that m also satisfies E V,i>{ Vj =>- m(i>{) m(vj). 

Definition 3. A Graph Query Instance is a triple ( G,G',f ) where G = (V,E) 
and G' = (V , E') are graphs and f is a SPF from G to G' . G is called the query 
graph and G' is called the data, graph. 

Definition 4. The restriction of the graph G = (V,E) to the set of node V 
where V C V, denoted G'[V r/ ], is a graph G r = (V r ,E r ) such that V r = V and 
Mv,v' E V r/ ,( c, t/) E E <tv- (tqt/) E E r , We say that G'[V r/ ] is the subgraph of G 
that is induced by V ' . 

With some abuse of notation, we will write G[f], where t is a. tuple containing 
nodes, instead of G[T], where T is the set of nodes contained in t. 

A Graph Query Solution is defined as a. table in the relational model 3 . 

Definition 5. The solution of the graph query (G, G ' , f), where G = (V, E) and 
G' = (V',E'), denoted S(a,G',f )• a relation r(V) such that the tuple t Gr if: 

— Vv E V, f[c] E V (iS for each node in V, there is an associated node in W). 

— There exists a mapping function, say m, from G to the restriction of G' to 
the nodes in t, G'\f\, such that \/v E V: 

• = yn[t>] (i.e, the association is determined by m), 

• if f is defined on v, i[c] E f(v) (i.e, the association respects the SPF 
constraints ) . 

If the mapping function respects the Distinct (resp., Non-Distinct) semantics, 
we say that the solution is in the Distinct (resp., Non-Distinct) semantics. 

For example, if the query graph is G = ({1, 2, 3}, {(1,2), (1,3)})$ 

the data, graph is G' = ({a, b, c}, {(a, b), (a, c)}) and the SPF is / = 
{1 {a}} then Sfcqa'.f) i 11 the Distinct semantics is f?,(l,2,3) = 

{(a, b, c , ), (a, c, b), (a, b, b), (a, c, c)} 

For any node v E V , IT v (S(g,g' ,f )) is called the solution for v. For example, 
the solution for 2, in the previous example, is {b, c}. 

Definition 6. The partial graph query solution of thp graph query instance 
( G,G',f ) where G = (V,E), for the set of nodes V r C V, is the solution of 
the graph query instance (G[V r \,G r , f r ) where /,. is the restriction of f to the 
nodes ofV r . 

In this paper, we consider algorithms that build the query solution progres- 
sively, by join ing partial query solutions. 

3 We denote by r{R ) that, the schema for relation r is R. 
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(a) 




Fig. 2. A Data Graph and the Data Graph Tables 

2.2 Finding the Graph Query Solution 

We now describe how the gra.ph query solution (in the Noil-Distinct, semantics) 
can be found 4 . We define a. general class of algorithms, called progressive algo- 
rithms. These algorithms model an important feature of WWW query processing , 
i.e., the discovery of the topology of the data, graph during query execution. 

We use relational algebra, to describe the algorithms, and therefore, we need 
first, to describe the gra.ph query instance as a. relational database. 

Definition 7. Let G = (V,E) be a graph and let lq be a subset ofV, lq C V. 
The target (resp., source ) set of Vi, denoted T(\ q) (resp., S'(li) y ), is the set 
{ v 2 | 3 iq £ Vi, (v l ,v 2 ) E E} (resp., { v 2 | 3tq £ Vi, (v 2 ,v l ) £ E}). 



Definition 8. The graph query schema for the graph query instance ( G,G',f ),, 
where G = (V, E) and G' = (V',E'), denoted D(Q,G',f q is composed of: 

— The Data Graph tables: for all v' £ V' , D(Q,G',f ) contains a table 
Taryets v i(node), i.e a single column table whose column name is node. 

— The Query Graph tables: for all v £ V such that v has outgoing edges, let 

T({c}) = {c lt . . . , D(G,G’,f) contains a table, Solutions v (v, tq, . . . , iq). 

In the following sections, we describe how the gra.ph query schema, is used to 
build the gra.ph query solution. In the algorithms described below, for a. given 
data, gra.ph G' = ( V ,E'), the content of the data, gra.ph tables is fixed and defined 
in the following way: Targets,,' = { (if) \ if £ V' A (v',if) £ E' }. For example, 
the data, gra.ph tables for the data, gra.ph in Figure 2(a) are presented in Fig- 
ure 2(b). The schema, of the query gra.ph tables for the query gra.ph in Figure 1 is 
Solutionsi(I,a,b), Solutions a (a,a r ), Solutionsh(b,b'), Solutions a i (a! ,G) and 
Solid ionsy (b r ,G). The data, gra.ph tables, and the tuples contained in them, 
model the WWW pages and the embedded hypertext links. The query gra.ph 
tables are used in the algorithms for holding temporary results. 

4 Transforming the algorithms to handle the Distinct semantics is straightforward and 
it is briefly described in a subsecprent section. 
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Capturing Query Graph Nodes The central concept, of progressive algo- 
rithms is the capture of query graph nodes 5 . Let v be a. node of the query graph 
in the query instance I = ( G,G',f ). There are edges from v to the nodes in 
T({r}). If we know that the solution for r, IT v (S(G,G',f'))t is contained in a. given 
set, say Sol, we can use the tables Targets v i , where v' E Sol, to find a. superset 
of the solution set for each node in T({r}). This superset, say S, is the same for 
all such nodes in T({r}): S = Targets Therefore, it is possible to use 

Sol to build a. superset of the query solution for the restriction of instance I to 
the nodes {r} UT({ti}). This operation is called capturing v. 

For example, let Q be a. query instance whose query graph is presented in 
Figure 3 and whose data, graph is presented in Figure 2(a). The SPF is {1 i-e 
{a.,6}}. We know that IJi(Sq) C /( 1) = {a, 6}. Therefore, by capturing node 
1, we learn that C Targets a U Targets], = {c, d}. Similarly, H 3 (Sq) C 

{c, d}. 

The capture of a. node is described in Function 2.1, 
capture-node b . For example, let Q be a. graph query in- 
stance whose query graph is presented in Figure 3 and whose 
data, graph is presented in Figure 2(a). The SPF is / = 

{1 i-e {a., 6}}. The function call capture-node(l,f(l)) returns 
the table Solu.tionsi(l,‘2,Z) = {(a,c,c),(a,d,c),(b,c,c),(b,d,c), 
(a,c,d),(a,d,d),(b,c,d),(b,d,d)}. Observe the (potentially) 
large size of the table generated by the Cartesian product, in 
line 4 of the capture-node function. The function execution can 
be optimized and we have presented here a. simplified version for 
the sake of clarity. 

The progressive algorithms cannot, be applied to arbitrary graph query in- 
stances. Therefore, before continuing the description of the progressive algo- 
rithms, we define the class of graph query instances that can be solved (using 
progressive algorithms). 

Practically Computable Graph Query Instances An exhaustive search for 
graph patterns in the WWW is infeasible. Therefore, for a. query to be practically 
computable, every query graph node must, be reachable from some node on which 
the SPF is defined. 

Definition 9. Let ( G,G',f ) be a graph query instance, G = (V,E). ( G,G',f ) 
is a practically computable graph query instance if there exists a set of nodes 
{ci,...,c„.} C V, denoted St(G,G',f )> stick that Vi E [l,n], f(v{) is defined, and 
Vv E V, there is a directed path in G from some node in {tq, . . » , r„.} lo v. 
ci, . . . , v n are the Starting Points of (G, G' , /). 

5 This is conceptually similar to the capture of nodes in [17] in the context of rule/goal 
graphs. 

6 In the description of the algorithms, the statement for (a E A) is a loop that iterates 
over the elements of set A, assigning the value of the current element to the variable 
a. 

\ is the set difference operator. 
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Function 2.1 capture-node 

Input: 

— A query graph node, v. 

— A set of potential solutions for v, Sol. 

Output: The query graph table, Solutions,,. 

Method: 

1 begin 

2 Solutions „ = 0 

3 for (v' £ Sol ) do d is the arit.y of Solutions „ 

4 Solutions,, = Solutions,, U ({(A)} x (Targets,,/ ) d ~ 1 ) 

5 od 

6 return( Solutions ,, ) 

7 end 



This definition captures the following fundamental fact,. The only source of 
WWW addresses (UR.Ls) in WWW query processing is the user. All other ad- 
dresses encountered during query execution are found in pages reachable from 
the pages whose addresses are explicitly defined in the query by the user (using 
the SPF). We concentrate now on how to find solutions for practically com- 
putable graph query instances. 

The Progressive Algorithms The core of all the progressive algorithms is a. 
loop in which each node of the query graph is captured in turn. At each iteration 
of the loop, the partial query solution is updated using the query graph tables. 
Before we explain the algorithms in detail, the following facts are noteworthy. 

1. As we show in the query optimization section, the cost of query execution 
depends on the data, graph tables accessed in building the query solution. 
During query execution, capture-node(v,Sol) is called for each node of the 
query graph. Sol is a. superset of the solution for v computed using the SPF 
and the query graph tables of the nodes captured thus far. Let v' be a. node 
in Sol. Capture-node uses Targets v i to compute the query graph tables If 
v' is not in the solutions for c, this step is superfluous and inserts, in the 
query graph tables, tuples that do not contribute to the query solution; 
v' is a. dangling data, graph node (for c). Dangling data, graph nodes are 
analogous to dangling tuples when optimizing transmission cost by semijoins 
in distributed R.DBMS [17]. To optimize the execution of the query, we try 
to reduce the number of dangling data, graph nodes. 

2. If a. query graph node v has no outgoing edges (it is a. sink of the query 
graph), then it is not necessary to capture it. The solution for v may be 
computed using the data, graph tables of the nodes pointing to v. We do not 
care where nodes that correspond to v point to. 

When a. progressive algorithm begins, the nodes that are candidates for 
capture are the starting points of the graph query instance, i.e. the nodes in 
St(G,G',f). If captured is the set of the captured nodes a.t, some iteration of a. 
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progressive algorithm loop, the node that is captured next, is either a. starting 
point or a. node in T (captured) that is not already captured, i.e a. node in the set 
((St(G,G',f) U T (captured )) \ captured). Since there is no need to capture nodes 
whose out-degree is zero, we denote by V e the set of nodes of the query graph 
whose out-degree is not zero, and at an iteration of a. progressive algorithms the 
candidates for capture are given by ((St(G,G’,f) U T (captured)) \ captured) fi V e 

Definition 10. Let ( G , G' , f) be a graph query instance, (tq, . . . , v n ) is a capture 
ordering if: 

— vi,...,v n are the vertices of the query graph G that have outgoing edges. 

- (Vi € [l,n])(v f € St (G ,G’j ) V m € T({v 

A progressive algorithms must choose a. capture ordering. In the worst case, if 
the query graph is a. clique and the SPF is defined on all the query graph nodes, 
then there are n! possible capture orderings. 

The core of the progressive algorithms is presented in Function 2.2. The 
important variables of the main function are capt ured, the set of captured nodes, 
and S, the partial query solution for the captured nodes. At line 5, a. node v E V 
is chosen among the current candidates for capture' . Lines 7-10 compute the set 
of candidate solutions for v. If v is a. starting point and capt ured does not contain 
nodes pointing to v (Line 7), the candidate solutions for v are given by f(v) (Line 
8). Otherwise, some node pointing to v has already been captured. Therefore, 
column v of S contains the candidate solutions for v (Line 9) (We will see below 
why, if v is a. starting point, IT V (S) C f(v)). Next, v is captured (Line 11). Lines 
13-15, select from Solutions,, the tuples that comply with / 8 , and, in line 16, 
Solutions v is joined to S. Therefore, Vm E St(G,G',f ) Cl captured, H U (S) C f( u). 
Finally, v is added to captured. When all the nodes in V e have been captured, 
S is the query solution 9 . 

We have not described the Choose function. This will be the subject of the 
next section on graph query optimization. 

The solution in the Distinct semantics is obtained by slightly modifying the 
main function of the algorithm. Recall that in the Distinct semantics, different 
query graph nodes must map to different data, graph nodes. This is achieved by 
selecting tuples in S that have different values in all their columns. That is, if 
t is the relational algebra, select, expression: Every two different columns have 
different values in t, the statement, S = cr T (S) must, be added after line 16 in 
Function 2.2. 

3 Optimizing Graph Queries 

The various progressive algorithms are differentiated by the way they choose 
nodes to capture. At, each iteration of the while-loop, a. node is chosen and, as 

1 We do not. explain here how this choice is made. In fact, the various instantiations of 
progressive algorithms are differentiated in the way they choose the node to capture. 

8 The condition vs E f(vs) stands for V st ^f( vs )(vs — st). 

9 Due to space limitation, the proof of correctness of the progressive algorithms is not 
included. 
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will be shown, the choice policy may have a. crucial effect oil the algorithm’s 
performance. 

3.1 Cost Model 

In the WWW, the dominating cost is that of communication. We are mainly 
interested in measuring only the influence of choice policies on the amount, of 
communication. Therefore, our cost model considers only the communication 
costs and not the cost of performing the local table joins required by the algo- 
rithm 10 . Furthermore, the local table joins may be optimized independently. 

In the algorithms as presented thus far, it is unclear where exactly commu- 
nication over the network is performed. In a. network context (over the WWW), 
the graph query schema, and the progressive algorithms are interpreted in the 
following way; a. data, graph table corresponds to a. HTML page and the content 
of the data, graph table is the set of hypertext links in the page which point 
to other pages; the query graph tables are used only for local query processing. 
Therefore, the communication part, of the algorithms is “hidden” in the use of 



10 Query optimization in relational databases is an extensively studied problem [8, 16], 
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the data, graph tables. In fact., line 3 of the capture-node function can be re- 
written as: 



Request(Targets v i ) 

SoIutions v = SoIutions v U ((t/) x (Targets v i) d ~ 1 ) 

Requ.est(Targets v r) may be interpreted as get the HTML page v' and collect 
the outgoing hypertext links. Therefore, we assume the existence of a. function 
Cost : V' — ^ N, that models the cost of requesting a. data, graph table (N denotes 
the integers). 

Another factor affecting performance is the caching policy. If a. data, graph 
node appears in the solutions of several query graph nodes, it is useful to put the 
corresponding data, graph table (i.e., the HTML page) in a. cache in order not 
to access it over the network more than once. Studying complex caching policies 
is beyond the scope of this work. We only examine two extreme cases: no cache 
and infinite cache. If there is no cache, the cost of execution is calculated by 
transforming line 3 of the capture-node function to: 



Request(Targets v i ) 

Cost = Cost + Cost(v') 

Solutions v = Solutions v U ((A) x (Targets v i) d ~ 1 ) 

If there is a.n infinite cache, the cost of execution is calculated by transforming 
line 3 of the capture-node function to: 



if ( t/ ^ cached) then 

Request ( T argets v > ) 
cached = cached U {t/} 

Cost = Cost + Cost(v') 

fi 

Solutions v = Solutions v U ((A) x ( Ta.rgets v i ) d ~ 1 ) 

In both cases, the cost of execution is the value of the Cost variable when the 
algorithm terminates. The fundamental problem of graph query optimization is 
to find a. capture ordering that leads to the minimum cost. We assume a. serial 
execution; parallel execution is beyond the scope of our study. 

In the WWW context, the value of Cost(v'), where v' is a. query graph node, 
may be interpreted in different ways. We may take Cost(v') to be the time 
it takes to bring v' from the network. In this case, Cost(v') is unknown to the 
optimization algorithms before the completion of the request operation. We may 
also consider that Cost(v') is the size of the WWW page corresponding to t/ 11 . 
In the latter case, a.n optimization algorithm can compute the cost before the 
actual request by using the HTTP HEAD network request [7]. Our optimization 
algorithms model the two possibilities. 



11 Practically, the relative size of a WWW page may be used to approximate the relative 
request, time. 
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(b) The Flow Graph - The edges are 
labeled with the captured nodes 



Fig. 4. A Query Graph and the Corresponding Flow Graph 

3.2 The Flow Graph 

Our main tool for studying graph queries optimization, called the flow graph, 
models the different, possible states of query execution by considering the pos- 
sible contents of the captured set. Consider the query graph presented in Fig- 
ure 4(a.) 12 . The corresponding flow graph is presented in Figure 4(b) and is in- 
terpreted as follows. The upper node (labeled with 0) corresponds to the empty 
captured set. Since the SPF is defined only on the query graph node 1, node 1 
must be captured and the next state into which every progressive algorithm will 
enter ( captured = {1}) corresponds to the flow graph node labeled 1. It is then 
possible to capture the query graph nodes 2, 3 or 5, leading, respectively, to the 
states captured = {1,2}, captured = {1,3} or captured = {1,5}. Each path in 
the flow graph, from node 0 to node 1,2, 3, 4, 5, corresponds to a. possible capture 
ordering. 

If the query graph contains n nodes, there are n! possible capture orderings. 
The use of the flow graph reduces the number of tested orderings. The flow graph 
considers only the possible contents of the captured set instead of the possible 
orderings leading to them. The number of possibilities is reduced to 2”. This 
technique was used to optimize joins in System- R. [16]. Formally: 

Definition 11. The flow graph of the graph query instance (G,G',f), where 
G = (V,E), is the graph Gf = ( Vf,Ef ) defined in the following way: 



12 



The SPF is defined on the double-circled nodes 
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— The nodes ofVf are the possible contents of the capt ured set. This is defined 
inductively as: 

• 0 e v f 

• W E Vf , Vi’new E (StC\T(v))\v, such that T({c„. etu }) 7^ 0 , the flow graph 
contains the node v U {'C„.ew}- 

— There is an edge from u. to v if the captured set v is obtained by capturing a 
node when captured = u, that is ||c \ «|| = 1. 



3.3 Optimized Progressive Algorithms 

We can build a. taxonomy of the optimization algorithms along several dimen- 
sions. 

— Local vs. Global Algorithms. Local optimization algorithms do not. build the 
entire flow graph. Instead, a. local optimization algorithm considers a. node 
in the flow graph, calculates the cost for each out.-edge (i.e., each possible 
capture) and moves along a.11 edge with the minimal cost to a. new flow 
graph node. A global optimization algorithm builds the whole flow graph, 
associates a. cost with each flow graph edge and searches for the least, cost, 
path from its current, node t.o the node with captured = V. 

— Offline vs. Online Algorithms. An offline algorithm does the optimization 
prior t.o query execution while an online algorithm may change the capture 
ordering during query execution. Also, an online algorithm can gather sta- 
tistical information, on the part, of the network being explored, in order t.o 
improve its initial estimates. 

— Rule-Based vs. Cost.-Based Algorithms. Rule-based optimization algorithms 
use heuristic rules for choosing the capture ordering, whereas Cost.-based 
optimization techniques use (estimated) costs. 

We now present, in detail four typical optimization algorithms. 

The Greedy Algorithm The Greedy algorithm chooses the cheapest, node t.o 
capture a.t. each iteration. Greedy is a. local, online, cost.-based algorithm. Greedy 
can be defined in two different, “flavors” . In the first, case, it. is assumed that the 
cost, of a. data, graph node cannot, be computed prior t.o accessing it.. Therefore, 
Greedy chooses the query graph node whose Sol set. is the smallest.. The Choose 
function is presented in Function 3 . 1 . If the cost, of a. data, graph node can be 
computed prior t.o its capture, line 5 of the Greedy algorithm is replaced by: 



cost = 0 

for ( v r E Sol \ cached) do 
cost = cost + Cost(v') 

od 

Recall that the cost, function is different, if the data, graph tables are cached. If 
there is no cache, the cost, of each capture is simply the cost, of the nodes in the 
Sol set.. If there is a. cache, the cost, of the capture is the cost, of the nodes in Sol 
that are not in the cache. 
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Function 3.1 Choose - The Greedy Algorithm 
Input: V = ( ( ,S't U T(captured)) \ captured ) fl V c 
Output: The next, node to capture. 

Method: 

1 min = oo 

2 for (r £ V) do 

3 if ((»> £ St) A (5'({t>}) (1 captured — 0)) 

4 then Sol — f(v) else Sol — n v (S ) fl 

5 cost = ||5'oZ \ cached || If there is no cache, cached = 0 

6 if (cost < min ) then 

7 min = cost ; current = v 

8 fl 

9 od 

10 return(current) 



The Best- Source Algorithm Greedy does not. use the topology of the query 
graph in order t.o reduce the number of accessed data, graph tables. The potential 
solutions for a. node are in the intersection of the columns found in the Solutions 
table of the captured nodes that point, t.o it.. Therefore, it, is better to choose in 
the set, of candidates for capture, a. node that maximizes 1 1 .S' ( { 7’ } ) fl ca.ptured\\ 
since the intersection is more likely to be small and there will be less dangling 
data, graph nodes in the potential solutions for v. This optimization idea, is 
independent, of the cost, of the capture and is therefore considered to be rule- 
based. The rule of thumb can be stated as: Capture the node with the largest 
fraction of captured source nodes. 

The Global Best- Source Algorithm (GBS) GBS uses the same heuristics 
as BS. But,, instead of minimizing the fraction of non-ca.pt, ured source nodes 
a.t, each stage, GBS minimizes the sum of the fractions of non-ca.pt, ured source 
nodes for all the nodes in an ordering. This is done in the following way. The 
flow graph corresponding to the query graph of the query is constructed. Each 
flow graph node corresponds to a. possible content, of the captured set,. Each 
edge (u,v) is labeled with the fraction of the non captured source nodes when 
the captured node is in (v\u). Global Best-Source finds the cheapest, path from 
node captured = 0 to node captured = V. 

The Approx Algorithm As in GBS, Approx labels the edges of the flow graph 
and finds the least, cost, path from node captured = 0 to node captured = V. 
The difference is that each edge (u,v) is labeled with the approximated cost of 
capturing the node in ( c \ u). 

To compute the approximate cost,, Approx operates under the following as- 
sumptions: 

— Every starting point, is mapped to a data, graph nodes. 

— The out-degree of every data, graph node (i.e. the size of every data, graph 

table) is m. 







WWW Exploration Queries 33 



— The data, graph contains p data, graph nodes (p is, for example, the (approx- 
imated) size of the “site” against, which the query is evaluated). 

— Let v be a. data, graph node. The m edges that emanate from v are con- 
structed by choosing randomly m different target, nodes, out of the p — 1 
possible nodes, with an equal probability. 

— All the pages of the site have the same size (namely 1). 

In order to describe how Approx computes the approximative cost, we first, 
establish the following lemma.. 

Lemma 1. Let A be a set containing p elements. Let Ai, . . . , A n be n sets such 
that A{, i E [l,n], is constructed by choosing at random and uniformly ni{ dif- 
ferent elements out of A, then: 

1, The expected number of elements in n" =1 A; is pLLf =1 (mi/p). 

2. The expected number of elements in U" =1 A; is p(l — LLf =1 (l — m-i/p)). 

Therefore, given a. node c, the cost of capturing c, sa.y cost, is evaluated 
(inductively) in the following way: 

— If v is a. starting point then cost is a. 

— if v is not a. starting point, let iq, . . . , v n be the captured nodes that point to 

c, i.e. {civ. . = b'({ ! ’}) Ll captured. Consider a. node i>{, i E [l,n], and 

assume that we know that i>{ is mapped to S{ data, graph nodes. Each one 
of these data, graph nodes points to m data, graph nodes that are potential 
solutions for v. That is, the potential solutions for v are in the union of the 
S{ sets of m data, graph nodes each. Denote this set of potential solutions for 
v induced by i>{ as S{. From the preceding lemma, (part. 2), we know that the 
expected number of potential solutions for v found in the S{ sets, each with 
m data, graph nodes, is p(l — 7T?1 1 (1 — m/p)) = p(l — (1 — m/p) Si ). Define 
d = (1 — m/p). So, £’(||S'*||) = p(l — d Si ) , where S{ is the number of potential 
solutions for i>{. 

For a. data, graph node v' to be a. potential solution for c, it must belong to 
n"_ 1 Si where £ , (||S';||) = p(l — d Si ). It can be easily shown that, from the 
preceding lemma, (part. 1 ), the expected number of data, graph nodes that 
are solutions for v is p(7T”_ 1 p( ' 1 ~ <i '* ) = p(LLf =1 1 — d Si ). Therefore, cost is 

p(ITf =1 l-^). 

For example, consider the data, graph in Figure 5 and 
let each starting point be mapped to 10 data, graph nodes 
( a = 10 ), let the out-degree of each data, graph node be 10 
(m = 10 ) and let the size of the site on which the query is 
evaluated be 100 (p = 100). The number of potential solutions 
for each starting point is 10. The sets Si, SS and S 3 , i.e. the 
potential solutions for 4 induced by 1, 2, and 3, respectively, 
contain each 100(1 — 0.9 10 ) « 65 potential solutions for 4. The p^g 5 A Query 
potential solutions for 4 are in the intersection of Si , SS and Qj-app 
S 3 . Therefore, the expected number of potential solutions for 
4 (and the approximated cost of capturing 4) is 100((1 — (0.9) 10 ) 3 ) « 27. 
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4 Experiments 

We built a. query execution simulator to evaluate the performance of the differ- 
ent, optimization algorithms. The simulator enables the setting of the different 
parameters and calculates the cost of executing synthetic or user-defined query 
graphs on random, or abstractions of real WWW sites, data, graphs. 

We present the results of several optimization experiments. We simulated the 
execution of 30 query graphs. The query graphs are constructed thus. The basic 
query graph is G = ({1, 2, 3, 4, 5}, {(1, 2), (1, 3), (2, 4), (3, 5)}). We formed three 
sets, of 10 graphs each, by adding to G n randomly chosen edges in each set 
(n = 3,6,9, resp.). We have tested these query graphs on three types of data, 
graphs. 

— Tree Data Graphs. The data, graph is a. tree of depth d in which the out, 
degree of each node was chosen a.t, random between min and max. Further- 
more, r random edges are added to the tree. 

— Levels Data Graphs. The data, graph consists of d sets of nodes (the 
levels). The first, level contains one node and all the other levels contains 
width nodes. The maximum in-degree ( id) and out-degree ( od ) of a. node are 
defined and the nodes of two consecutive levels are linked, accordingly, a.t, 
random. 

— Actual sites. 

Before executing a. query, 10 query solutions are planted in the data, graph. The 
cost, of each (non cached) data, table access is 1. We present, the results of sim- 
ulations for three data, graphs 13 : G'i is a. tree data, graph with d = 3, min = 
2 , max = 15, r = 200 , Gb is a. levels data, graph with d = 3, width = 50 and G 3 
is the graph of an actual site. We used the set, of all the pages accessible from 
the URL http : / / wrn . cs . technion . ac . il/ p-f-of f . html. The organization of 
this site is similar to the site described in the first, example (Fig 1). The simula- 
tions were done (1) with an infinite cache and (2) with no cache. The results of 
the simulations are shown in Figure 6 . Pairwise comparisons of the algorithms 
(without, cache) are presented in Table 1. 

Analysis From these results we can make the following observations: 

— Caching is obviously very important,. 

— The differences in performance between the algorithms are significant, and 
“wrong” capture orderings can result, in very expensive query plans. 

— GBS has only a. slight, advantage over BS. Since GBS tends to capture query 
graph nodes that, are “far” from starting points, it, sometimes creates very 
expensive orderings. 

— Approx and GBS are usually better than Greedy. 



13 



We present, the results for three data graphs, however similar results have been 
obtained when experimenting with other data graphs (created in the same way). 
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Algorithm 




No Cache No Cache No Cache Cache Cache Cache 
3 6 9 3 6 9 



(a) Average number of requests per 
algorithm over all data sets for G'i 



(b) Average number of requests per 
algorithm and per data set for G'i 




(c) Average number of requests per (d) Average number of requests per 

algorithm over all data sets for G'2 algorithm and per data set for G'2 




Greedy BS GBS Approx 

Algorithm 




(e) Average number of requests per (f) Average number of requests per al- 
algorit.hm over all data sets for G'3 gorit.hm and per data set for G'3 



Fig. 6. Results of simulations. 
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Greedy vs. GBS 


Greedy bet- 
ter 


3 


10 


123 


6 


20 


14 


2 


6.6 


17 


Equal 


14 


46.6 


- 


9 


30 


- 


15 


50 


- 


GBS better 


13 


43.3 


-15 


15 


50 


-17 


13 


43.3 


-21 



Table 1. Some Pairwise Comparisons of the Algorithms. The table is read as follows. 
For each data graph, pairs of algorithms are compared (A vs. B). The line “A better” 
analyzes the cases in which algorithm A costs less than algorithm B. Columns “N” and 
“%” contain the number, and the percentage, of query graphs on which A is better. 
Column “Diff (%)” contains the difference of costs between the two algorithms. For 
example, in the comparison between BS and CfBS on the data graph G'i, BS is better 
than CfBS on 2 query graphs (6.6%) and CfBS is better than BS on 14 query graphs 
(46.6%). When BS is better than CfBS, CfBS costs, on average, 143% more than BS. 
When CfBS is better than BS, CfBS costs, on average, 28% less than BS. 



— The differences between the offline algorithms and Greedy when there is 
an infinite cache is very small. The reason for this may be that the offline 
algorithms do not take into account the content of the cache. 

— When the query graph is highly constrained (n = 9), very few data, graph 
nodes are “dangling” and, therefore, the differences between the algorithms 
is small. 

The experimentation suggests the development of online algorithms that use the 
topology of the query graph (as do BS and GBS) and use the flow graph to 
“look ahead” using approximated costs (like Approx). Adaptive algorithms, i.e. 
algorithms that improve their initial assumptions as they navigate the WWW, 
refine their knowledge, and dynamically redefine the execution plan, are worth 
checking. 

5 More Expressive Languages 

5.1 Content Conditions 

Users do not usually search for just hypertext structures. Rather, they impose 
content conditions on the WWW pages they access. In order to model such 
queries, we consider edge and node labeled data, graphs. The query graph is 
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augmented with a. set of content conditions. For example, a. content, condition 
may state that some query graph node must be mapped to data, graph nodes 
that contain a. specified string. The progressive algorithms can be adapted to 
handle queries with content, conditions. 

The optimization algorithms presented above have natural extensions to 
queries with content, conditions. For example, BS may be modified as follows. 
Given a. candidate node for capture, sa.y c, let, a be the number of captured 
nodes pointing to v. Let, c be the content, condition defined on v and s be the 
selectivity of c 14 . The constraining factor of v is defined as aa. + f3s where a 
and j3 are weighting factors. BS captures the node with the largest, constraining 
factor. 

Note that, (1) conditions on edges can be treated in a. similar way, and (2) 
the algorithms can be refined further to handle “join” conditions between query 
graph nodes, e.g. “query graph nodes 1 and 2 must, be mapped to HTML pages 
that, have the same title” . 



5.2 XML objects and Semi- Structured Data 

It, is possible to consider the case in which the query graph nodes correspond 
to (semi-structured) objects embedded in pages (encoded, for example, using 
XML [4]). In this case, a. WWW page may contain several query graph nodes, 
and edges between query graph nodes may be local (i.e., link data, graph nodes 
that, appear in the same page) or external (i.e., link data, graph nodes that, appear 
in different, pages). 

To understand how the progressive algorithms can be used in this frame- 
work, consider the following example. An index page points to publication pages 
in which publication objects are organized on a. per author basis (i.e., all the 
publications of one author are contained in the same page). The publication 
objects point, to journal issue objects that, are found in other pages which are 
organized per year. We search for authors that, have at, least, two publications in 
two different, journal issues in the same year. This query is shown in Figure 7 
(and must, be evaluated under the Distinct, semantics). The dotted boxes contain 
objects that, must, appear in the same page. The dotted arrows represent, links 
between pages that, are induced by the links between the embedded objects. The 
dotted boxes and the dotted arrows form a. superstructure graph that, may be 
used to obtain pages from the network by using a. progressive algorithm. 

Obviously, the progressive algorithms must, be adapted to handle queries 
involving objects embedded in pages. In particular, the cost, model and the opti- 
mization algorithms must, take into account, the possibility of requesting several 
objects in one page request,. However, the global approach of the progressive 
algorithms is still a. promising approach for these cases. 

14 For example, if c is “v must, be mapped to nodes that, contain the string t”, .s can 
be approximated by checking the number of HTML documents containing t in the 
corpus of pages under consideration. 
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Fig. 7. Example of a graph query involving embedded objects 

6 Conclusions 

We have defined the class of graph queries and a. class of algorithms for solving 
them. We showed how these algorithms can be optimized. Further work on this 
subject includes: 

— Defining New Optimization algorithms. The algorithms presented here 
use only one optimization technique at a. time. We intend to develop algo- 
rithms that use several techniques, i.e., algorithms that use simultaneously 
the information gathered from the actual data, graph (as Greedy), the topol- 
ogy of the query graph (as BS and GBS) and the flow graph labeled with 
approximated costs (as Approx). 

— Testing other query graph and data graph topologies. We plan to 
examine if some optimization algorithms are more suited for particular query 
graph or data, graph topologies. 

— Analyzing caching policies. We began to analyze caching policies using 
standard caching algorithms (FIFO and LRU) and specialized algorithms 
for graph queries. 

— Simulating Query graph with content conditions. We plan to exam- 
ine how to modify the optimization algorithms in the presence of content 
conditions on the query graph nodes and links. 

— Defining and testing parallel query processing algorithms. World- 

Wide Web browsers usually open several network connections in order to load 
concurrently several World-Wide Web objects. The same technique may be 
used to speed up query processing. 

— Analyzing the effect of projections. Graph queries with projections are 
graph queries in which the user is interested only in the solutions for a. proper 
subset of the query graph nodes. 

— Finding appropriate query algorithms when the query graphs are 
extended with regular expressions. Query languages for semi-structured 
data. [1] use regular expressions to specify queries on data, graphs whose 
organization is not completely known. We plan to extend the progressive 
algorithms to handle regular expression constructs. 

— Taking into account the local processing cost. 

— Defining standard server services. Optimization algorithms benefit from 
the availability of statistics on the data, being queried. The progressive al- 
gorithms will help in defining what is the information that servers should 
provide in order to make a. site “query friendly” . 
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Abstract. A prototype system was developed to test the applicability of a dual- 
method information-filtering model for filtering e-mail messages: content-based 
filtering and sociological filtering implemented with user stereotypes. This pa- 
per reports the main results of experiments that were run to determine the ef- 
fects of combining the two methods in various ways. A major outcome of the 
experiments is that the combination of both methods yields better results than 
using each method individually. The optimal combination of the two filtering 
methods is stereotype dependent. 



1 Introduction 

Information filtering systems differs from traditional information retrieval systems, in 
that their users have long-term information needs that are described by means of user 
profiles, rather than ad-hoc needs that are expressed as queries. There exist two main 
filtering approaches [3], [5]: content-based filtering and sociological filtering. The 
two approaches differ in the methods used for constructing user profiles and the tech- 
niques used to calculate relevance of data items (documents). In content-based filter- 
ing, the user profile and the filtering technique are based solely on the content of in- 
formation. The user’s profile consists of a list of keywords that represent his areas of 
interest, and the filtering process is aimed at finding out to what extent the content of a 
candidate document is close to that profile. Most commercial filtering systems employ 
content-based filtering, since the method is relatively easy to implement, and produces 
reasonable results [4]. 

Sociological filtering is usually (e.g. [1] and [8]) interpreted as a collaborative 
process that bases the filtering on “similar” users. For a given user, a group of users is 
found whose feedback, recommendations, or content-based profile is most similar to 
his. The filtering consists of calculation of a document’s rank, as based on comparison 
of the user profile or the evaluated document to corresponding parameters of “similar 
users”. However, similarity of users in most systems is content-based since it is cal- 
culated on the basis of similarity of their content-based profiles or on feedback. 

We adopt a different interpretation of sociological filtering: we claim that demo- 
graphic parameters of the user, such as education, occupation and work experience, 
influence his preferences and habits in consuming or filtering information. For exam- 
ple, a researcher and a programmer may have the same area of interest (i.e. same con- 
tent-based profile), but owing to their affiliation, education and occupation, different 
documents may be relevant to them. To cope with such differences, the user profile 
must include demographic information in addition to a content-based profile. 

We assume that users who share demographic parameters also have common pref- 
erences and habits with respect to their information needs. This idea is expressed by 
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means of “user-stereotypes”. A user-stereotype is a “centroid” for users who have 
similar demographic background and similar information filtering behavior. This 
behavior is represented by a set of filtering rules that are common to those users. A 
user is assigned to a stereotype on the basis of commonality (closeness) of demo- 
graphic parameters, and the filtering rules of “his” stereotype are applied to determine 
the relevance rank of documents. This forms the sociological filtering method, which 
is applied in addition to content-based filtering. 

Our filtering model combines content-based and sociological filtering in various 
ways. A prototype system that implements the model was developed, to examine its 
applicability and effectiveness. The model and the system that implements it are over- 
viewed in section 2. Section 3 describes how we implement sociological filtering. 
Section 4 describes experiments conducted to test the model with various filtering 
strategies in the domain of e-mail messages. Section 5 analyses the results of the ex- 
periments, and Section 6 provides conclusions and discusses further research. 



2 The Model and the Prototype System 

The filtering model is described in greater detail in [6]; here we provide a brief over- 
view only. The model contains four databases: 

Dl: Raw Database - contains incoming documents (e-mail messages) to be filtered. 

D2: Represented Documents - a document is represented as a weighted-vector of 
keywords. 

D3: User Profiles - contains two types of profiles for each user: a) a content -based 
profile - presented as a weighted-vector of keyword, and b) a sociological profile - 
includes demographic parameters of the user, e.g., education, occupation and age. 

D4: Stereotype Data & Rules: contains descriptions of known user stereotypes. Each 
stereotype is represented by a set of demographic parameters that are common to 
users who “belong” to the stereotype, and by a set of filtering rules that are typical 
to those users. Each rule refers to a parameter in documents. For example, for an e- 
mail message, the parameters may include its goal (purpose), source and length. A 
rule specifies the relevancy of a document to a user who belongs to the stereotype, 
with respect to a certain parameter. For example, the rule 
If (goal = ‘conference’) then rank <- 5.9 determines that, for a certain stereo- 
type, messages announcing conferences are of high relevance (on a 0-7 scale). 

The model contains three main processes: 

FI: Representation Process - converts raw documents to a vector of weighted-terms. 

F2: Filtering Process - this main process of the model calculates the relevance rank of 
each document. As said, it incorporates two main methods: 1) Content-based fil- 
tering, where a document is examined for relevancy to the user on the basis of his 
areas of interest. This is accomplished by calculating the statistical correlation 
between the vector of keywords representing the user interests and the vector of 
keywords representing the document. 2) Sociological filtering, where relevance of 
a document is calculated by applying relevant filtering-rules of the user’s stereo- 
type. Each filtering method produces a relevance rank for the examined document. 
The overall rank is some combination of the two ranks. One of our goals is to find 
combinations of both filtering methods that yield best relevance ranks. 

F3: Learning Process - Based on feedback from the user and filtering process, this 
process may update the user’s profile, his stereotype’s rules, and even re-assign the 
user to a different stereotype. (We do not elaborate on this process.) 

We examine two combinations of the filtering methods: consecutive and parallel: 

• In the consecutive combination, one of the filtering methods is considered “pri- 
mary”, i.e. more important. Thus, a document is first filtered by the primary 
method, and only if its resulting rank is above a certain relevance threshold, the 
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second filtering method is applied, providing a second relevance rank. The overall 
relevance rank of the document is a weighted-average of the two ranks, with more 
weight given to the primary method. If the relevance rank of the primary method is 
below a relevance threshold, this rank is considered as the overall rank of that 
document. 

• In the parallel combination, each of the filtering methods is applied on every 
document, providing its relevance rank. The overall rank of a document is the av- 
erage of the two ranks. 

As already stated, one of our objectives is to examine the different filtering ap- 
proaches and find out optimal strategies for different user stereotypes. We have im- 
plemented the filtering model in a prototype system, designed to enable experimenta- 
tion, i.e., evaluation and ranking of documents using different filtering strategies. For 
the purpose of this research the system was implemented to evaluate e-mail messages. 
The messages are initially evaluated by their users (recipients) who use special soft- 
ware that serves as a front-end interface to various e-mail systems: users who receive 
e-mail messages from various list servers utilize that software to evaluate and rank the 
relevancy of their incoming messages on a 1-7 scale. The user-evaluated messages are 
saved. At the experimentation stage, the system computes the relevancy of each of 
these messages several times, according to the different filtering strategies that are 
examined. The objective of the filtering system is to evaluate the relevance of docu- 
ments (messages) as close as possible to the users evaluations. 

The main component of the prototype system is a filtering module that implements 
the two filtering methods in various ways. The filtering module utilizes two main 
engines: the content-based engine, and the sociological engine. The prototype system 
enables the experimenter to evaluate each e-mail message according to the following 
filtering strategies: 1) Content-based filtering alone; 2) Sociological filtering alone; 3) 
Two-phase parallel combination (both content-based and sociological filtering); 4) 
Two-phase consecutive combinations, with either filtering method as the “primary” 
(i.e. content-based followed by sociological, or sociological followed by content- 
based), with different weights to each method. 



3 Implementation of Sociological Filtering 

In order to implement the model and perform experiments, we have to create user 
profiles (both content-based and sociological), form user stereotypes, and for each 
stereotype define filtering rules and sociological parameters to represent it. 

This process is based on user interviews. We need to interview users (i.e., recipi- 
ents of e-mail messages) in a certain domain, in order to identify their sociological 
parameters and information filtering rules. The environment domains for the imple- 
mentation and the following experiments are information technology departments at 
universities. The information users are academic researchers, information specialists, 
graduate students, and computer technicians. We have interviewed forty e-mail users 
from that domain who subscribe to several list-servers each. The interviews were 
based on a questionnaire consisting of two main parts: 

The first part includes questions on sociological parameters that might affect users 
in their information seeking and filtering behavior. These parameters will found a 
basis for defining stereotypes’ sociological parameters, and for assigning new users to 
existing stereotypes. For each question/parameter we provided a set of possible an- 
swers (values). Table 1 presents the sociological parameters, their possible values and 
their numerical decoding. The result of this part of the interview is sociological pro- 
files of the forty users. As an outcome of this part of the interviews, a set of eight rules 
with identical parameters, but with different value, was defined. Table 2 presents the 
parameters of the eight rules and their meaning. 
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Table 1. Sociological Parameters of Users 



Parameter 


Possible Values (and their numerical decoding) 


Education 


Ph.D. (1), M.Sc. (2), Engineer (3), B.Sc.(4), 
Technician (5) 


Occupation 


researcher (1), information specialist (2), 
computer professional (3), student (4) 


Level 


junior (1), intermediate (2), senior (3) 


Computer knowledge 


novice user(l), experienced user (2), professional (3), 
computer scientist (4) 




up to 25(1), between 25-40 (2), above 40 (3) 


No. of lists subscribed 


up to 2 (1), between 2-7 (2), above 7 (3) 


Use of e-mail 


once in a couple of days (1), once a day (2), 
several times a day (3) 


Weekly use of Internet 


up to 5 hours (1), between 5-10 hours (2), 
above 10 hours (3) 


% of e-mail filtering 


no filtering (1), up to 20% (2) , between 20%-50% (3), 
above 50% (4) 



Table 2. Filtering Rules 



Parameter 


Meaning of the rule 


conference 


Announcement of a conference 


paper 


Call for papers 


internet 


Reference to sites on the Internet 


technical 


Technical message 


job 


Job offer 


fund 


Announcement on funds (research grants, etc.) 




Message whose length is above two screens 


| history > 2 


Message topic already discussed (replied) more than twice 



We used a clustering technique [2] for the partition of the users to stereotypes. 
Clustering is a suitable data-analysis technique for stereotype formation [7] since 
clusters and stereotypes share the basic idea of setting groups whose members are 
similar in various parameters. The similarities among users who form a stereotype are 
based on commonalties in patterns of information usage, as deduced from the values 
of filtering rules assigned by the users. Each rule provides one parameter in the simi- 
larity calculation, and its value is the numeric value for the calculation. 

The next step was to determine the rules that are applicable to each of the stereo- 
types and their values, where value of a rule actually means the relevance-rank that 
will be given to an evaluated message by this rule. To determine the rules that are 
applicable to a certain stereotype, the statistical average of each rule’s values for all 
users that belong to that stereotype was calculated. Only rules whose standard devia- 
tion is below a certain threshold are selected to represent the stereotype. The justifi- 
cation for this is that a low standard deviation implies unity of the rule value among 
members of the stereotype. The value of each rule for a certain stereotype is the aver- 
age of the values of its rules. Table 3 displays the rules that represent each stereotype 
and their average values. As can be seen, the stereotypes differ in the set of the rules 
that represent them, and in the average values of their rules. 

Each rule is implemented by special procedures, which determine how to evaluate 
and rank messages. Ranking is based on multiplication of two factors: the average 
value of that rule for the stereotype (as shown in Table 3) and a certainty factor which 
indicates to what degree is the rule relevant to the evaluated message. Hence, if a rule 
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is not relevant at all, its rank is 0; if it is 100% relevant, its rank is equal to the average 
value for that stereotype (as in Table 3); otherwise, its rank equals to the multiplica- 
tion of the two factors. 



Table 3. Rules of Stereotypes and their Average Values 



Rule 


Average values of rules per stereotypes 1 


Stereotype 1 


Stereotype 2 


Stereotype 3 


Stereotype 4 


conference 


5.83 


2.45 


4.78 


4.38 


paper 


5.92 


1.36 


2.00 


1.38 


Internet 


4.25 


- 


4.89 


- 


technical 


- 


- 


1.89 


6.63 


job 


2.08 


- 


2.22 


1.5 j 


fund 


6.5 


1.82 


- 


1.88 ! 


iiSiTSiflra^E&g.c., 


- 


- 


- 


1.5 


| history 


1.92 


1.73 


1.44 


- 



To compute the certainty factor of a rule, the system looks for specific indicators 
in the message that enables it to determine the pertinence of the message to the rule. 
The indications are based on the occurrences of appropriate terms in e-mail messages. 
For example, to determine with 100% certainty that a message announces a confer- 
ence, at least one of the following must hold: 1) The body of the message includes 
more than one occurrence of terms from a pre-defined list of “conference terms”, and 
at least one occurrence of a date. 2) The subject header of the message includes at 
least one occurrence of a “conference term”. Partial fulfillment of conditions will 
result with lower certainty. 

We need to define the sociological parameters and values that represent each 
stereotype, so as to enable the assignment of new users to the right stereotypes. To 
accomplish this, the frequency of values of each of the sociological parameters is 
calculated for users who belong to a given stereotype. Only parameters whose fre- 
quency is above a certain threshold are considered as representative of a stereotype. 
The most common value of the parameter is selected to represent that parameter in 
that stereotype. For example, if the most common value of education of stereotype 1 is 
“technician”, then “technician” represents this stereotype on that parameter. Table 4 
presents the sociological parameters and their values for each stereotype. 



Table 4. Representing Sociological Values for the Stereotypes 



Parameter 


Stereotype 1 


Stereotype 2 


Stereotype 3 


Stereotype 4 


Education 


1 (Ph.D.) 


2 (M.Sc.) 


- 


- 


Occupation 


1 (researcher) 


2 (information 
specialist) 


4 (student) 


Euni 


Level 


- 


f. 




3 (senior) 


Computer 

knowledge 


- 


2 (experienced 
user) 


- 


3 (professional) 




3 (above 40) 








No. of lists 


2 (2-7 lists) 


- 




2 (2-7 lists) 


Use of e-mail 


2 (once a day) 


2 (once a day) 


3 (more than 
once a day) 




Weekly use of 
Internet 






IBIS 




% of e-mail 
filtering 


3 (20%-50%) 




1 (no filtering) 


3 (20%-50%) 
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4 Experimentation 

We conducted a series of experiments with the prototype system, aimed at examining 
the applicability of the model, especially the applicability of sociological filtering 
integrated with stereotypes, the impact of using dual-method filtering (i.e., various 
combinations of sociological and content-based filtering), and the optimal filtering 
strategy for different user stereotypes. The experiments involved ten users from the 
same domain that was used to define the stereotypes and filtering rules. 

To enable content-based filtering, we had to prepare a content -based profile of 
each of the ten participants, describing his areas of interest. Each participant received 
a proposed list of terms generated from several dozen of his incoming e-mail mes- 
sages. The list included the most frequently occurring terms in those messages. (It was 
prepared with the aid of special software that extracts terms from messages, employ- 
ing look-up tables and a stop-list, and counts the frequency of meaningful terms). Each 
participant was asked to review the proposed list of terms, add or drop terms, and 
weigh each term for its degree of interest to him, using a 0-100 scale. 

To enable sociological filtering, we had to prepare a sociological profile of each of 
the ten participant users, and then assign each of them to an appropriate stereotype. 
The sociological profile of each participant was created with a questionnaire; similar 
to the way it was created for the forty original users. The assignment procedure that 
relates a user to a stereotype calculates the Euclidean distance between the user’s 
sociological profile and the vectors of sociological parameters that represent each of 
the existing stereotypes (see Table 4). The stereotype whose vector of sociological 
parameters is closest to the user’s profile is chosen to be “his" stereotype. 

For the actual experiments with the various filtering strategies, each of the ten par- 
ticipants evaluated the relevancy of about 200 e-mail messages that came in from list- 
servers dealing with professional matters. This was done with the aid of front-end 
software that was developed for this purpose. The users used a 1-7 scale to rank the 
relevancy of each message. Once a user evaluated a message, its rank, along with the 
message and the user identification, were saved to a special file. The same messages 
were evaluated later by the filtering system; each message was evaluated several times 
- each time using a different filtering strategy. The output of these runs were a file for 
each participant user, containing for each message, the user’s evaluation (rank) and 
the ranks produced by each of the filtering strategies. 

The following is an example for the evaluation of a certain e-mail message ob- 
tained by a certain user. The message announced an opening for a post-doctoral fel- 
lowship. That user (who was assigned to stereotype 1 - researchers - as based on his 
sociological profile) ranked the relevancy of this message as 3, possibly because he 
was not looking for a job. His partial interest in the message may be a result of the 
information it provides on job openings in his working area. 

In content-based filtering, the system evaluates this message by computing the 
correlation between the weighted-terms in the user’s content-based profile and the 
frequency of those terms in the message, relatively to the message length. (The system 
identifies and counts the frequency of meaningful terms, as described earlier for the 
construction of content-based profiles.) For this message the relevance rank is 4 . 27 . 
(The correlation is actually 0.61 in a -1 to 1 scale; 4.27 is its transformation onto a 1-7 
scale.) This rank is high compared to the user’s rank, because the message contains 
terms that appear in the user’s profile, but - as indicated by the user - the message is 
not so relevant to him (rank 3). This example shows that content-based filtering alone 
may not be sufficient, and may even be misleading. 

In sociological filtering, the system evaluates this message by applying the six 
filtering rules of stereotype 1 (that user’s stereotype). Each of these rules may provide 
one rank/value, and the overall sociological rank for the message is the average of the 
applicable rules’ ranks. The rules of stereotype 1 yielded the following ranks: 

• Conference = 0: system found no evidence that the message is about a conference. 
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• Paper = 0: system found no evidence that the message is about a call for papers. 

• Internet = 0: the message does not include an Internet address. 

• Job = 2.08: system found that the message is definitely about a job opening, and 
therefore rated the message according to the value of this rule for stereotype 1 . 

• Fund = 4.5: system found that the message is about funding with 70% certainty, 
which is multiplied by the value of the rule - 6.5. (70% certainty is because the stem 
“fund” appears in the message body, but not in the subject header). 

• History = 0: the message does not refer to earlier messages having the same subject 
obtained by the user (i.e. no “reply” with same subject in the message header). 

Thus, the overall rank of this message, according to the sociological filtering 
method, is the average of the applicable rules’ ranks, i.e., (2.08+ 4.5)/2 = 3.29. In this 
example, the sociological filtering rank is closer to the user’s evaluation, because this 
message is more related with the user’s sociological profile (and hence with the filter- 
ing rules that apply to his stereotype) than with his content-based profile. 

The rank of this message according to the parallel filtering strategy is calculated 
as the average of content-based and sociological ranks: (4.27+3.29)/2 = 3.78, which is 
better (i.e. closer to the user’s evaluation) than content-based filtering, but worse than 
sociological filtering. The result for consecutive filtering strategy where content- 
based filtering is the primary method, with weight 70%, followed by sociological 
filtering is: .7*(4.27)+.3*(3.29) = 3.976. For consecutive filtering strategy where so- 
ciological filtering is the primary method the rank is 3.29, equal to sociological filter- 
ing alone. This is because the relevance threshold for sociological filtering is 3.5; 
since the rank obtained by sociological filtering is lower than the threshold, it becomes 
the overall rank of the message. 

In conclusion, for this particular example, the best strategy is sociological filtering, 
then comes a consecutive strategy where sociological filtering is the primary method, 
and followed by the parallel filtering strategy. Content-based filtering is worst strategy 
for this case. Of course, for different users (in the same or different stereotypes) and 
different messages, different results may be obtained, meaning that different filtering 
strategies may be more appropriate. 

The overall goal was to find out which filtering strategy generates evaluations that 
are mostly correlated with the user evaluations, and if any filtering strategy is consis- 
tently more effective than other for different user stereotypes. The comparison of 
filtering strategies was done within stereotypes because the experiments are meant to 
examine the effect of sociological filtering as integrated with stereotypes. To do so, 
for every filtering strategy, a vector that includes the ranks of messages given by users 
that belong to a certain stereotype was correlated with the system-produced ranks for 
the same messages. 

5 Analysis of Results 

The main results of the experiments are summarized in Table 5. It is divided into 4 
main sections, each referring to one stereotype. Rows shows the correlation between 
the system’s rankings of the messages (Y) and the users’ rankings (X). Each row re- 
fers to a different filtering strategy. In the two-phase consecutive strategies the weights 
of the methods are 70% for the primary and 30% for the secondary. r(X,Y) is the 
correlation coefficient; r 2 - the coefficient of determination; t - the statistic of signifi- 
cance test; and p - the level of significance of the correlation (p<.05 is considered 
significant). N is the number of messages. The results are interpreted as follows: the 
higher the correlation between rankings of a filtering strategy and of the users be- 
longing to a given stereotype - the better it is, because the system’s evaluation of 
messages is similar to the user’s evaluation of the same messages. 
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Table 5. Correlation Results for the Four Stereotypes 



p 


t 


r 2 


r(X,Y) Filtering Strategy 




Correlation Results for Stereotype 1 - Researchers; N=429 I 


.0000* 


14.533* 


.3309* 


.5753* 


Content-based 


.0000* 


11.207* 


.2273* 


.4767* 


Sociological 


.0000* 


15.002* 


.3451* 


.5875* 


Parallel 


.0000* 


15.766* 


.3680* 


.6066* 


Content-based + Sociological 


.0000* 


12.447* 


.2262* 


.5160* 


Sociological + Content-based [ 


| Correlation Results for Stereotype 2 - Information Specialists; N=469 i 


.0000* 


12.636* 


.2548* 


.5048* 


Content-based 


.0000* 


11.914* 


.2331* 


.4828* 


Sociological 


.0000* 


17.938* 


.4079* 


.6387* 


Parallel 


.0000* 


10.647* 


.1953* 


.4420* 


Content-based + Sociological 


.1624 


1.399 


.0042 


.0646 


Sociological + Content-based 


| Correlation Results for Stereotype 3 - Students; N=179 | 


.0000* 


7.053* 


.2194* 


.4686* 


Content-based 


.0000* 


6.412* 


.1885* 


.4342* 


Sociological 


.0000* 


9.682* 


.3462* 


.5884* 


Parallel 


.0000* 


10.385* 


.3786* 


.6153* 


Content-based + Sociological 


.0000* 


7.797* 


.2556* 


.5056* 


Sociological + Content-based 


| Correlation Results for Stereotype 4 - Technical Staff; N=350 j 


.0000* 


8.292* 


.1650* 


.4062* 


Content-based 


.0000* 


16.045* 


.4252* 


.6521* 


Sociological 


.0000* 


18.548* 


.4971* 


.7051* 


Parallel 


.0000* 


11.663* 


.2810* 


.5301* 


Content-based + Sociological 


.0000* 


16.378* 


.4353* 


.6598* 


Sociological + Content-based 



Based on those tables, here are our main observations on the results per stereotype: 

• Stereotype 1: Content-based filtering alone is better than sociological filtering 
alone, but parallel filtering is better than either method alone. The best strategy is 
consecutive filtering with content-based as primary method. The worst strategy is 
sociological filtering alone, while second to worst would be consecutive filtering 
with sociological as primary. At any rate, sociological filtering improves perform- 
ance when applied in parallel with or following content-based filtering. 

• Stereotype 2: Parallel filtering yield the best results, followed by content -based 
filtering alone and then sociological filtering alone. Surprisingly, the results for the 
two consecutive methods are lower than for each method alone. (Particularly low 
are the results when sociological filtering is the primary method.) 

• Stereotype 3: Here again, the consecutive strategy with content-based filtering as 
primary method yields the best results, followed by parallel filtering. Then comes 
consecutive with sociological filtering as primary method. Last again is sociologi- 
cal filtering alone, and second to last be content-based filtering alone. 

• Stereotype 4: Again, parallel filtering is best, but second best is consecutive with 
sociological as primary method. Following that is sociological filtering alone. The 
worst strategy for this stereotype is content-based filtering alone and second to 
worst is consecutive with content-based filtering as primary method. 

There results are summarized in Table 6 which shows the strategies within each 
stereotype by descending order of correlation with the users’ evaluations. The fol- 
lowing interesting observations on the results concern all stereotypes: 
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Table 6. Order of filtering strategies within stereotypes 



Order 


Stereotype 1 


Stereotype 2 


Stereotype 3 


Stereotype 4 


1 


Content-based + 
Sociological 


Parallel 


Content-based + 
Sociological 


Parallel 


2 


Parallel 


Content-based 


Parallel 


Sociological + 
Content-based 


3 


Content-based 


Sociological 


Sociological + 
Content-based 


Sociological 


4 


Sociological + 
Content-based 




Content-based 


IliiM 


5 


Sociological 


Sociological + 
Content-based 


Sociological 


Content-based 



1. For all stereotypes, all system’s evaluations (except one) correlate significantly 
with the users’ evaluations. This suggests that all filtering methods yield effective 
results, that match with the users needs to a certain extent. (The correlation coeffi- 
cients range between 0.406 to 0.705.) 

2. For most stereotypes (1, 2 and 3), content-based filtering alone provides higher 
correlation than sociological filtering alone. Hence, sociological filtering cannot sub- 
stitute content-based filtering, which is based on the contents of documents. 

3. For most stereotypes (1, 2 and 3), the consecutive strategy where content-based is 
primary filtering method provides higher correlation coefficients than the consecu- 
tive strategy where sociological is primary method. 

4. In no case are content-based or sociological filtering alone better than some combi- 
nation of content-based and sociological (either consecutive or parallel). 

5. For two stereotypes (2 and 3), parallel filtering turned out to be best, and for the 
other two (1 and 4) - second best. Consequently, in all cases parallel filtering is bet- 
ter than content-based alone or sociological alone. 

6. Most of the two-phase filtering results have higher correlation coefficients than the 
corresponding one-phase filtering method. I.e., parallel filtering is compared to ei- 
ther of the filtering methods, while consecutive combination is compared in each run 
to the filtering method that was considered as “primary” method on that run. 

In conclusion, for every stereotype there is at least one significantly better result 
when the strategy is to combine the two filtering methods in some way. So, dual- 
method filtering is definitely vital. However, there is no single strategy that is “best” 
in all cases; the best strategy is stereotype-dependent. In other words, for every 
stereotype the best strategy needs to be discovered. This can be done by means of 
experimentation or after gaining experience with the system. 



6 Conclusions 

We implemented sociological filtering by means of rules attached to user stereotypes. 
We showed that sociological filtering, even as a single filtering method is significantly 
correlated with user evaluations, implying that it can be used to predict the relevance 
of documents. However, we found that in most cases content-based filtering alone is 
more correlated with user evaluations than sociological filtering alone. This should not 
be taken as a surprise; after all, content-based filtering is about the content of docu- 
ments, which is obviously a major criterion for relevance. 

We found it clearly that for every stereotype, there exists at least one combination 
of the two filtering methods (consecutive or parallel) that is better than either filtering 
method alone. The conclusion from that is that dual-method filtering is better than any 
single-method filtering. 
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The experiments show that there is no single “best” strategy for combining con- 
tent-based and sociological filtering. For different stereotypes there may be different 
“best” combinations: parallel, content-based followed by sociological, or sociological 
followed by content-based. Hence, the optimal filtering strategy may be considered as 
a stereotype characteristic, which can be inferred by experimentation and experience. 

Our experiments did show the applicability of dual-method filtering model, in- 
cluding the integration of sociological filtering with stereotypes. Furthermore, the 
consistent results, which are based on about 200 messages per participant, certainly 
show the correctness of the approach. However, because of the small number of par- 
ticipants, we can not claim for external validity of the results. For that, more experi- 
ments, encompassing more users and involving a variety of application domains, are 
needed. We plan to implement the model in other domains, such as digital libraries, 
and test the various filtering strategies with more users. Other domains may enable us 
to identify different user stereotypes and more filtering rules. 

The filtering model includes a learning process, but we have addressed it only par- 
tially. In the future, we plan to extend the learning process, so that it will become 
possible to detect new rules or change existing rules as based on user feedback, as 
well as due to changes in the user population and their information needs. 

In the current model, a user is assigned to one stereotype only. In further research 
we plan to extend this to enable assignment of a user to multiple stereotypes. This may 
be problematic, because there may be conflicting rules or rule values in each of those 
stereotypes; appropriate methods to resolve such conflicts must be investigated. 

Another issue for further research is the effectiveness of stereotype-based socio- 
logical filtering. In the current model sociological filtering is determined according 
filtering rules attached to user stereotype. An alternative approach could be to have 
personal filtering rules attached to each user. Such rules may be defined similar to the 
way the user’s content -based profile is defined. The research issue is to compare the 
two alternative approaches. It may be found out that a combination of the two ap- 
proaches is desired, so that at the beginning (i.e. for a new user) sociological filtering 
will be based on a stereotype and it’s filtering rules. Later on, as the user gains experi- 
ence, the system may assist him to define his personal filtering rules, which will be 
used instead of the stereotype’s rules. 
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Abstract. In this article we describe a method for refining an initial set 
of search results, obtained through a meta-search engine, based on 
relevance feedback. The idea is to interactively obtain from the user a 
subset of relevant documents in an ongoing query, thereby providing a 
sample of the related vocabulary. Terms acquired in this way are 
combined with the terms initially in the query, in order to improve 
retrieval precision. In our method the user is also asked to select a 
subset of irrelevant documents, so that terms may be combined 
negatively in the query. A model of compatible architectures, in which 
the method can be implemented, is presented. An instance of such 
model, the system Web Query Reformulator (WQR), is described, with 
some of its performance results. 



1 Introduction 

The World Wide Web has become a global source of information in all areas of 
human interest, ranging from commerce to science. Nevertheless, the potential for 
exchange and sharing is not yet matched by the ability to actually access and retrieve 
specific information of interest to the user. Keyword searching is a particularly 
challenging task, due to the enormous volume of information that can be returned. In 
general, the bulk of matching results is completely irrelevant. This situation gets 
worse by the week and consequently, improving the ratio of the number of hits to 
number of retrievals is at the top of the agenda for any search tool. 

Keyword searching has been implemented in library information systems for 
decades and as a result there is a fair amount of ground covered by the information 
retrieval community on the subject. Although the conventional library techniques are 
ready to use in the context of Web searching, there are two major factors that 
distinguish the usual centralised library information system. First, the scale of 
magnitude of the Web is unique. Because of this size discrepancy, the application of 
such early techniques can be profitable but only if coupled with additional 
mechanisms for filtering and ranking the search results. Secondly, the Web is a totally 
distributed collection of multimedia documents, which is not completely indexed by 
any one of the available spiders’ databases. The fact that these indexing systems cover 
overlapping portions of the Web motivates the concept of meta-search engine, which 
queries a number of underlying search engines in parallel. 
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In this article we describe a method for refining an initial set of search results, 
obtained through a meta-search engine, based on relevance feedback [1]. The 
principle of this approach to the refinement of search results is asking the user to 
select a subset of documents, which are relevant to an ongoing query, thereby 
providing a sample of the related vocabulary. Terms are acquired and ordered with 
the use of standard methods of information retrieval, such as stemming, term co- 
occurrence, term length and document frequency. The resulting terms are combined 
with the terms initially in the query, in order to improve retrieval precision. In our 
method the user is also asked to select a subset of irrelevant documents, so that terms 
may be combined negatively in the query. 

Although our method is completely general, for a number of reasons related to the 
dynamics of the interactive querying process and to performance issues, a suitable 
underlying search system must have a few desirable properties. In particular, the 
systems we have in mind comply with a client-server architecture that includes 
features such as storage structures to support learning and history log and availability 
of computing power on the client side to cope with expensive string processing tasks. 

The paper is organised as follows. In section 2, the application of information 
retrieval techniques to Web searching is discussed and the main features of publicly 
available search engines are presented. In section 4 we present our method for 
refining search results inside a "query session” and the general architecture of the 
meta-search engine in which our query expansion technique can be incorporated. In 
section 5 we present preliminary test results. Finally, in section 6 we discuss the most 
closely related works and in section 7 we draw our conclusions. 



2 Web Searching 

Most of the techniques used for indexing and searching in the Web derive directly 
from the legacy of traditional Information Retrieval. Although much research has 
been focusing on knowledge-intensive approaches to Information Retrieval, indexing 
and searching in the Web still relies heavily on very basic techniques, going back 
over 20 years, such as document frequency/inverse document frequency, stop words 
and stemming. But the Web poses new and tough challenges to these techniques, due 
to its sheer volume, great diversity and accelerated volatility. 

One novel aspect of indexing in the Web is spamming, and all the heuristics that 
have to be built in to try to neutralise it. The specifics of the Web also have effects 
over stop word lists, which have to be dynamic in order to adjust to the changing 
nature of the collection. Despite increasing efforts to define structural standards, such 
as the use of metadata, retrieval on the Web relies heavily on the traditional vector 
and probabilistic representation models. Term weighting is widely used, which takes 
into account the location of the term inside the document (title, use of keywords like 
"summary”, “conclusion”, etc) or ad-hoc heuristics like capitalisation, bold faces, etc. 

In such a massive volume of documents, it is especially important to contextualise 
the user's information need. One of the strategies that are being very successfully used 
for that purpose is query expansion, particularly when it has the benefit of the user's 
relevance feedback. Query expansion is thus a form of adding terms to the original 
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query presented by the user, which is typically very short, so that the augmented 
query can be better compared to documents in the proper context. The selection of 
additional terms may be based on a few documents that the user assessed as adequate 
or employing other techniques such as thesaurus, domain databases, etc. When using 
relevance feedback the user may or may not be directly involved in the term selection 
process. Some systems will automatically select the “best” terms out of the “relevant” 
documents. 

Attempts have been made to use query expansion techniques in Web search 
engines, but the results are less than impressive. In particular, relevance feedback 
implementations, like "more like this" are very ineffective. 



2.1 Search Engines 

Ordinarily, the Web user employs one or more search tools as a starting point to 
access Web pages of interest. These tools, AltaVista, Yahoo, etc, have different 
degrees of user-friendliness, effectiveness and range. They offer a variety of search 
facilities, in the user interface, in the actual query language, and sometimes in the 
structural organisation exhibited in directories. 

Pro-active search engines have their own or robots, which periodically traverse 
the Web's hypertext structure, to build and maintain their own massive database of 
indexes. These databases turn out to be quite different from one another because the 
robots use different strategies: what documents to consider, what links to follow, what 
words to index, what weights to use for each word, etc. 

Subject trees offer a browsing service, where the user can follow subject trees or 
ontological directories to find information. Typically these catalogues are built with 
human selection of documents and judgement of which subject the document belongs 
to, as well as links to pages that offer wide coverage of that subject and contain a 
useful number of further links. An example of a subject tree is Yahoo. Because 
subject trees rely on humans for their overall design and maintenance, they typically 
provide links to a smaller number of documents than a keyword query made against a 
spider-automated index. 

Meta-search engines, instead of keeping their own indexes, route the query in 
parallel to various other search engines. The answer pages of each of those different 
search engines are collected and subsequently the result is consolidated into a list with 
new rankings and a uniform presentation. 

In MetaCrawler [2] and Fusion [3] the user chooses from a list of search engines 
with the option all , the aggregation by normalising the score provided by each search 
engine. Pro-fusion [4] and Savvy Search [5], on the other hand, select the engines that 
they will use based on the query and some other heuristics. 

Relevancy Rankings - Most of the search engines return results with confidence 
or relevancy rankings. The majority uses mainly term frequency to whether a 
document is relevant, with possibly the position of keywords in the document being 
taken into account. Another common criterion is whether the document is frequently 
linked to other documents on the Web. 
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3 Refining Search Results 

In order to describe our approach to refining search results, first we describe a typical 
query session, the processes involved and the assumptions we make. Secondly we 
present the architecture of a system capable of hosting such a query session. 



3.1 The Query Expansion Method 

A typical query made by the user of a retrieval system is made up of very few terms. 
From these terms, an initial set of documents is fetched from various index systems, 
each contributing with a specific number of documents. Typically, there is a 
reasonable amount of duplicates. Some are easily recognisable as such (same URL, 
for instance) but most have deceptively distinct descriptions. The more these 
duplicates are eliminated the wider the range of search results. Also, as it will become 
clear, fewer duplicates make our method more sound. 

From the entries list of retrieved documents - titles, abstracts and other bits of 
information - the user may select a subset of relevant ones and another subset of 
irrelevant ones. From this assessment, the frill versions of those selected documents 
are fetched. Each document selected is individually analysed. The terms are clustered 
by stem and their frequency of occurrence is computed. After going through all the 
documents two global data structures are created, one related to the documents 
assessed as relevant and the other to the documents assessed as irrelevant. The 
structures contain, for each stem, its document frequency (number of documents 
containing the stem) and total frequency (total number of occurrences of the stem in 
the retrieved collection). 

The terms to be positively added to the query are amongst those extracted from the 
relevant documents, and are ranked in order of document frequency, length (number 
of component stems) and total frequency. This is a variation of the algorithm 
proposed in [6]. Conversely, the terms to be negatively added are amongst those 
extracted from the irrelevant documents and are ranked in order of total frequency 
provided they do not occur in any of the relevant documents. 

At this point we have two possible routes to follow: 1. the automatic construction 
of the new query with the highest scoring terms; 2. involving the user’s judgement 
with respect to which terms to add, positively or negatively, to the query. There are 
arguments for both approaches and we leave the choice open. 

The expanded query is broadcast again to the available indexing systems and 
another cycle of interaction takes place. 



3.2 Compatible Architectures 

The basic architecture of meta-search engines is shown in fig. 1 . An interface receives 
the user’s input of keywords and possibly some search options. These specifications 
generate URLs, which are broadcast to a list of search engines. Upon receiving the 
answer pages, the fusion of the results takes place and an answer page is produced. 
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Fig. 1 . meta-search engine architecture 

This architecture does not support - as our does - the concept of a query session, 
which is a slice of time and computer resources totally dedicated to one particular 
user with a specific information need. A more suitable architecture is one that 
complies with a client-server model as shown in fig. 2. 



CLIENT SERVER 




Fig. 2. client-server based architecture 



On the client side, the basic functional modules are described as follows. 

The Control module is responsible for dispatching the query to the reference 
servers and managing the local database, as well as controlling all the interactions 
with the user. 

The Aggregator module performs the consolidation of the results, which basically 
involves re -ranking and identifying duplicates originated by different reference 
servers. The output of this module is the list of ranked documents, with normalised 
scores. In case the same document is returned by more than one reference server, its 
ranking reflects the addition of individual scores in order to reward the agreement 
between search tools. The reference server itself filters duplicates from the same 
search tool 

Identifying duplicate results may be crucial in obtaining a sound list of results. 
There are a number of methods for performing such a task, of which we found the 
Dice proximity measurements over terms of the title, abstract and URL, a particularly 
cost-effective one. 
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The Document Analyser processes the documents that were singled out by the 
user, extracting the terms of the HTML document, grouping them by stemming and 
computing their frequency of occurrence. Phrases may be taken into account, 
regardless of number of terms. The resulting global term vector is kept by the Control 
Module, which will be processed by the Term Selection Module. 

The main attributions of the Term Selection module are: 1) to eliminate 
redundancies, i.e. terms already accounted for within phrases; 2) to rank the 
remaining terms based on their document frequency, length of phrase and frequency 
inside each document. Nevertheless, there are a number of highly profitable 
techniques that can be incorporated here, mainly related with previous interactions, 
user profile and domain knowledge. The term selection process may also take into 
account all the documents assessed as relevant within a query session. 

Although not essential, we strongly recommend the maintenance of a database 
containing a broad range of historic information, which can be used as a training set 
in machine learning for filtering responses and other relevancy measures. 

On the server side, the reference servers are responsible for the exchanges 
between the meta-search system and the Web search mechanisms with their own 
databases. Each search mechanism has a corresponding reference server, which have 
knowledge of the specific details of that particular mechanism. One of the parameters 
passed to the server is the number of desired hits on results. It is up to the reference 
server to guarantee that the requested number of distinct results is returned, i.e., if a 
duplicate is identified another document is solicited. When a query is provided to the 
system on the client side, that same query is dispatched to all the configured reference 
servers. The server also receives the time constraint for the retrieval operation. 

In summary, the reference servers’ role is twofold: 

• Processing the query, by the translating and forwarding it to the corresponding 
Web search mechanism; 

• Parsing the results’ page(s), extracting structured fields, such as title, abstract and 
URL. 

3.2.1 WQR 

Based on the architecture blueprinted above, we implemented WQR - Web Query 
Reformulator. In particular, some important characteristics of the WQR are: 

• it is a Java application with heavy usage of multi-threading; 

• there is no limit to the number of words (stems) that can be combined into a 
phrase. Any sequence of terms can be considered a phrase, provided it occurs more 
than once and starts and ends with terms that are not stop words. Imposing a limit 
produces very artificial constructs. For example, limiting phrases to three terms [7] 
would partition “Fourth Annual International World Wide Web Conference” into 
“Fourth Annual International”, “Annual International World”, "International 
World Wide”, “World Wide Web” and “Wide Web Conference”. 

• duplicate documents are eliminated with the Dice similarity measure. 

The concept of using query reformulation with relevance feedback in Web 
searching was originally proposed by Smeaton [3] in the Fusion system. The basic 
differences between WQR and Fusion are: 




56 Claudia Oliveira et al. 



• in Fusion, most of the processing is performed at the server side. In the client side 
concentrates the heaviest burden, so that a much larger number of users can be 
accommodated with limited processing power; 

• Fusion only takes into account documents assessed as relevant; 

• WQR indicates the operators that should be applied to the suggested terms; 

• WQR provides suggestions of phrases; 

• WQR can be naturally expanded to make use of learning mechanisms based on 
user relevance assessments. 



4 Analysis of Search Results 

In order to assess the potential benefits of WQR we adopted the following procedure. 

• A specific information need was selected: conferences with call for papers, in the 
area of textual information retrieval for the Web. 

• The following query was formulated: +‘‘call for papers” "information retrieval” 
web www "world wide web”. 

• The query was executed in 8 other search mechanisms available as well as in 
WQR, using all possible result enhancing features available. 

• We computed the precision of the first twenty references presented by each system. 
They were classified as relevant, irrelevant or somewhere in between - documents 
that were not precisely what we wanted but had links to relevant documents. 

• We performed two sets of tests with our system, one with automatic query 
expansion and the other with user interaction. 

Fig. 3 depicts the best results obtained in a single session for each system. For 
each search mechanism there is a vertical bar divided into 3 sections. The height of 
the black bottom section indicates the percentage of relevant documents retrieved by 
the search mechanism; the grey middle section indicates the not quite irrelevant 
documents; the white top section indicates the percentage of irrelevant documents. 
From left to right, the vertical bars correspond to MetaCrawler, SavvySearch, 
ProFusion, Fusion2, Infoseek, Excite, Hotbot, Altavista and WQR. 

The results of a single session show that our system had the highest precision, 
considering relevant and average documents (12 among 20 using interactive query 
expansion). This result was obtained after two interactions with the system, which 
resulted in 8/20 and 10/20 ratios, respectively. After the third interaction the precision 
deteriorated, going back to a 10/20 ratio. In our experiment, interactive query 
expansion had better performance than automatic query expansion. 

Although the experiment had a limited scope we noticed some important points. 

• Most of the other search engines returned to the user a significant number of 
duplicate references to the same conference. 

• For this particular query none of the enhancement features offered by the different 
search engines was effective. 

We feel that more experimentation is needed to effectively assess the benefits that 
our system is capable of providing, but we found these preliminary results very 
encouraging. 
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Fig. 3. precision of search mechanism 



5 Conclusions 

We propose a method of expanding user queries in the Web, using relevance 
feedback. Based on the assessment made by the user of the relevancy of documents 
presented to him as a result of his original query, we produce a ranked list of terms 
that may be added to the query, preceded or not by operators (+, -). 

Experiments indicated that the proposed method enhanced the performance of the 
search session as perceived by the user. 

One way to further improve the precision of our method is to apply 
filtering/categorisation techniques to the intermediate results, in order that documents 
that are clearly irrelevant can be moved down or excluded from the list. Another 
promising development is to provide mechanisms for creating profiles based on 
typical queries and relevance judgements of an individual user and use them to 
contextualise both the processing of the query and the filtering/categorisation 
mechanisms. 
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Abstract. One of the main characteristics of object-oriented database 
management systems is the explicit representation of relationships be- 
tween objects. A simple example for a query addressing these relation- 
ships arises, if we assume the object types Company , and Division with 
the relationship has-division from Company to Division. In this case a 
query might ask for the companies which have a division called “strat- 
egy”. The query might start with the companies and navigate to the 
divisions which can be reached via the has-division relationship. Finally 
the query has to check if the name attribute of the Division object is 
“strategy”. Since there is no direct condition for the companies in the 
query, this query execution will be costly. If we assume that there is a 
reverse relationship division-of from Division to Company , an alterna- 
tive execution plan might start with the “strategy ” divisions and follow 
this reverse relationship. In this case an index structure for the name 
attribute of the Division objects can be exploited to speed up query 
processing. 

In the present paper we describe a query optimizer which exploits this po- 
tential invertibility of navigational operations in queries. Our approach is 
based on, but not limited to the context of the ISO and ECMA standard 
PCTE and P-OQL. 

1 Introduction 

In contrast to relational database management systems object-oriented database 
management systems (ooDBMS) allow 7 for the explicit representation of the re- 
lationships between the maintained objects. The various ooDBMS differ mostly 
in the expressive powder of their modeling facilities for these relationships. Some 
systems allow 7 only special attributes which consist of a set of objects. Other 
systems provide the means of a link to represent a relationship between ob- 
jects. Furthermore some systems provide different link categories to represent 
different types of relationships, or they allow for the application of key- and/or 
non-key-attributes to the relationships. 

Such relationships are often addressed in queries to the database. Assume 
for example an object base with the object types Student, Course and Lecturer 
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Fig. 1 . Simple example schema with relationships 



and the relationships attends from Student to Course and held-by from Course 
to Lecturer. This schema is illustrated in figure 1. 

On this object base we could e.g. ask for the names of all students which at- 
tend a course held by lecturer “Smith'’ together with the corresponding course 
names. The straight forward way to evaluate this query would be to address all 
Student objects stored in the database and to check for each student if there is 
a path consisting of an attends and a held-by link referencing a lecturer named 
“Smith”. Under the natural assumption that there are far less lectures than 
courses, and far less courses than students, this is obviously an extremely costly 
operation. If there are, on the other hand, reverse relationships for the rela- 
tionships attends and held-by — as in our example schema — another query 
processing plan might start with the lecturers. Only for those lecturers with the 
name “Smith” the students attending a course held by the lecturer would have 
to be addressed in this case. Especially in the case where an index for the name 
attribute of the lecturers exists this query processing plan would be much more 
efficient than the first one. 

To exploit this potential invertibility of navigational operations in queries 
the query optimizer should consider query execution plans based on such in- 
verted navigational operations in addition to the conventional query optimiza- 
tion techniques. To this end, the query optimizer has to determine invertible 
path expressions and to select the most beneficial inversion. Thereby the query 
optimizer should consider the available index structures and select the structure 
(or structures) which should be applied. 

We have developed such a query optimizer for the P-OQL query language. 
P-OQL [8,9] is an OQL-oriented query language for the object management 
system of PCTE [19], which in turn is the ISO and ECMA standard for an 
open repository [17, 18]. The environment consisting of PCTE and P-OQL is ex- 
tremely challenging for the sketched optimizer facility, because the data model 
of PCTE contains extremely powerful facilities for the representation of rela- 
tionships. The relationships can have key and non-key attributes, and there are 
different categories of relationships. The query language P-OQL allows to define 
navigational operations by regular path expressions addressing the attributes 
and the categories of the relationships and providing various iteration facilities. 

In the following we first describe the essential concepts of PCTE and P-OQL 
(c.f. section 2). Thereafter we give various examples for the inversion of regular 
path expressions in P-OQL, which might allow for a more efficient query pro- 
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cessing. In section 4 we present the architecture of our query optimizer. Section 5 
deals with the cost estimation techniques employed to choose from the different 
possible query formulations and in section 6 we present some experimental re- 
sults. Finally section 7 gives a short discussion of related approaches and section 
8 concludes the paper. 



2 Example Environment 
2.1 PCTE 

As mentioned PCTE ( Portable Common Tool Environment) is the ISO and 
ECMA standard for a public tool interface (PTI) for an open repository [17-19]. 
As one of its major components PCTE contains a structurally object-oriented 
object management system (OMS) designed to meet the special requirements of 
software engineering environments. 

The data model of PCTE can be seen as an extension of the binary Entity- 
Relationship Model. The object base contains objects and relationships. Rela- 
tionships are normally bi-directional. Each relationship is realized by a pair of 
directed links, which are reverse links of each other. The type of an object is 
given by its name, a set of applied attribute types, and a set of allowed outgoing 
link types. New object types are defined by inheritance. 

A link type is given by a name, an ordered set of attribute types called key 
attributes, a set of (non-key) attribute types, a set of allowed destination object 
types, and a category. PCTE offers five link categories: composition (defining 
the destination object as a component of the origin object), existence (keep- 
ing the destination object in existence), reference (assuring referential integrity 
and representing a property of the origin object), implicit (assuring referential 
integrity) and designation (without referential integrity). 

Throughout this paper we will use the schema given in figure 2 as the basis 
for our examples. It consists of the object types Student, Course, Employee, 
Thesis and Project. The attribute types applied to each object type are given in 
the ovals at the upper left corner of the rectangle representing the object type. 

The link types are indicated by arrows. A double arrowhead at the end of a 
link indicates that the link has cardinality many. Links with cardinality many 
must have a key attribute. In the example the numeric attribute no and the 
string attribute problem are used for this purpose. For example the link type 
attends from Student to Course has such a key attribute and is hence described 
as “no. attends ” . Therefore an instance of this link type can be addressed by its 
link name which consists of the concrete value for the key attribute and the type 
name separated by a dot - e.g. “3. attends’’ . Link types with cardinality one do 
not need a key attribute and are given in the schema by a dot followed by the link 
type name. An ’E’ or ’R’ in the triangles at the center of the line representing 
a pair of links, indicates that the link has category existence or reference. 

Finally the schema contains the link type has-advisor as an example for a 
link type with a non-key attribute. In our case this is the attribute meeting. 
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With respect to the query optimizer presented in this paper some further 
characteristics of the PCTE data model have to be stressed: (1) The reverse link 
type of a link type is non-ambiguous. I.e. each link type has exactly one reverse 
link type and a link type can be the reverse link type for only one link type 
— except for designation links. (2) A link type may have multiple destination 
object types which may or may not be leaves of the inheritance hierarchy. (3) 
Since links with category designation are an exception of the rule that each link 
has a reverse link a navigation traversing a designation link cannot be inverted. 

2.2 P-OQL 

P-OQL [8] is an OQL-oriented query language for PCTE, and OQL ( Object 
Query Language) [2] is the ODMG proposal for a query language for object- 
oriented database management systems. The main differences between P-OQL 
and standard OQL are due to the adaptation to the data model of PCTE. Hence, 
especially the treatment of links is specific to P-OQL. 

A query in P-OQL is either a select-statement, or the application of an op- 
erator (like sum). 

Assume that we search for pairs with the name of a student and the name of 
an employee, where the student attends a course held by the employee and the 
student is in the fourth year. The following query in P-OQL yields these pairs: 
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select A:name, B:name 

from A in Student, B in (A:_. attends/. held.by/->.) 
where A:year = 4 

In the from-clause of this query two base sets are defined: Base set A ad- 
dressing all students in the object base and base set B addressing all objects 
which can be reached from the actual object of base set A via a path matching 
the regular path expression “A:_. attends/. held.by . In this path expression 
the prefix “A:” means that the actual element of base set A is used as the start- 
ing point of the definition, “..attends/. held.by” means that exactly one link of 
type attends and one link of type held.by must be traversed. The underscore 

is used as a wildcard for numerical key attributes denoting that arbitrary 
key values are allowed. In addition intervals can be specified for numerical key 
attributes and regular expressions can be used for string key attributes. Since 
there is no key attribute defined for the link type held.by, the notation ".held.by” 
is used to specify that a held.by link has to be traversed. The notation “/->” 
is used in the regular path expression “A:_.attends/. held.by /->.” to address the 
destination object of the path. In addition can be used to address the last 

link of the path. The dot “.” at the end of the regular path expression means 
that the object under concern is addressed. It is also possible, to address an 
attribute or a tuple of values (see [8] for more details). 

Due to the definition of an independent base set (base set A) and a dependent 
base set (base set B), each student is combined with each employee, which can 
be reached from the object representing the student via a path matching the 
regular path expression. 

Altogether, a base set can be defined in P-OQL in five different ways: (1) 
giving an object type name or an object type name suffixed by a meaning 
that objects of all subtypes are addressed as well; (2) using a link type name 
to address all links of a given type; (3) defining a set of objects or links using a 
regular path expression, as with base set B in our example; (4) defining a set of 
objects or links via a sub-select; or (5) passing a set of objects or links via the 
API (application programming interface) when submitting a query. 

In the where-clause of our example query the considered combinations of 
students and employees are restricted to those for which the year attribute of the 
student has the value “4”> Besides such simple conditions P-OQL for example 
allows the use of quantifiers and subqueries in the where-clause. 

The select-clause of the example query states that a multiset of pairs is 
requested. Each pair consists of the name of the student and the name of an 
employee who holds the course the student attends. 

An additional facility of P-OQL — which is relevant to our optimization 
problem — allows for the iteration over one or more links of the same type. 
Assume for example that we are interested in the subprojects of a project with 
subject “Workflow” . These subprojects can be determined in P-OQL as follows: 

select A:subject, A:[_.has.subproject]+/->subject 
from A in Project 
where A:subject = “Workflow” 
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Here the regular path expression “A:[_.has.subproject]+/->subject” is used 
to create a (multi-)set with the subjects of all subprojects of the project ad- 
dressed in base set A. The meaning of “[-.has_subproject]+” is that one or more 
links matching the link definition “_.has_subproject” have to be traversed. Al- 
ternatively P-OQL knows the iteration facilities “[ path-definition ]*” to indi- 
cate that zero or more paths matching path-definition have to be traversed and 
“[ path-definition ]" to indicate that a path matching path-definition is optional. 
After traversing an arbitrary number of has subproject links, the subject at- 
tribute of the destination object is addressed using “/->subject” . 

In addition to the link definitions used in the above example, which have 
been based on a given link type name and a definition of the allowed values 
for the key attributes, P-OQL allows the specification of a set of link categories 
instead, meaning that all links having one of the given categories fulfill this 
link definition. E.g. the expression "[{c.e}]+ / ->.” addresses all objects which 
can be reached via a path consisting only of links with category composition 
or existence. Furthermore, not only the category of the link itself, but also the 
category of its reverse link can be specified using as a prefix for the category. 

Finally it has to be mentioned that the ODMG standard OQL also permits 
the definition of reverse links and the use of path expressions for the definition 
of base sets. However, since the ODMG data model does not include attributes 
for links or link categories, the path expressions in OQL can be seen as a special 
case of the regular path expressions in P-OQL. 

3 Inverting Regular Path Expressions 

3.1 Simple Cases 

First let us consider a slightly extended version of the example query given in 
section 2.2. In addition we require that the address of the employee is “Cologne” : 

select A:name, B:name 

from A in Student, B in (A:_. attends/. held.by/->.) 
where A:year = 4 and B:address = “Cologne” 

In the first base set this query addresses all students and in the second base 
set the corresponding employees are addressed starting from the student actually 
under concern via the regular path expression “A:_.attends/.held.by/->.” . In this 
way each student is combined with each employee which can be reached via a 
path matching the path expression. Furthermore, due to the semantics of P-OQL, 
this means that when a student attends two courses held by the same employee, 
this student /employee pair is considered twice. However, the same result can be 
achieved as well when the query addresses all employees and inverts the regular 
path expression “A:_.attends/.held.by/->.” in order to address the corresponding 
students in a second base set. In this way we yield the equivalent P-OQL query: 

select A:name, B:name 

from B in Employee, A in (B:_.holds/_.attended_by/->.) 
where A:year = 4 and B:address = “Cologne” 
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To see that both queries are equivalent, first we have to note that the data 
model of PCTE assures that for each path matching the regular path expres- 
sion “..attends/. held.by/->.” there is a reverse path matching the regular path 
expression “_.holds/..attended.by/->.” . Hence, if an employee e is reached from 
a student s via a path matching attends/. held.by , there is always a re- 

verse path from e to s matching “_.holds/_.attended.by/->.” . If there are multiple 
paths from a student s to an employee e, we have exactly the same number of 
reverse paths. As a consequence, we have the same pairs in the base sets of both 
queries and each pair has exactly the same number of occurrences. 

If there is an index for the attribute address of the object type Employee 
which is more selective than the index for the attribute year of the object type 
Student, the second query will lead to a more efficient query execution plan. 

The above example might suggest that an inversion of a regular path ex- 
pression used to define a base set can always be performed easily without any 
effects on the select-clause and the wliere-clause of the query. Unfortunately this 
is not true. The situation becomes much more complicated whenever one of the 
following cases occurs: (1) The values of the key attributes are restricted in the 
regular path expression. (2) One of the iteration facilities of P-OQL is used. (3) 
The destination object type of a link is ambiguous. (4) A link definition in the 
regular path expression is stated by the allowed link categories. In the following 
sections we describe how these more complicated cases can be handled. 

3.2 Conditions for Link Key Attributes 

As mentioned in section 2.2 there are different facilities to restrict the allowed 
key attribute values for traversed links. In the following query we use two of 
these facilities in order to navigate from the student addressed in base set A to 
some specific projects. Furthermore we require that the student should be in the 
fourth year and that the duration of the project should be at least 36 month: 

select A:name, B:subject 

from A in Student, B in (A:[2.. 4], attends/. held.by/l.works.for/->.) 
where A:year = 4 and B:duration >= 36 

Due to the conditions for the link key attributes which are inte- 
grated in the regular path expression we cannot invert the path expression 
“A:[2.. 4], attends/. held_by/l.works.for/->.” in one step. Rather we have to split 
up the expression when inverting it, in order to address the key attributes of 
the reverse links, for which the conditions have to be enforced. In consequence, 
this means that the conditions which are integrated in the original regular path 
expression are moved into the wliere-clause of the inverted query: 

select A:name, B:subject 

from B in Project, HI in (B:*.under_dev_by->.), 

H2 in (Hl:/.holds/-.attended.by->.), A in (H2:/.) 
where A:year = 4 and B:duration >= 36 

and Hl:@no = 1 and H2:@no >= 2 and H2:@no <= 4 
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In the inverted query we first navigate from the project under consideration 
to all outgoing under-dev-by links 1 . These links are addressed in the auxiliary 
base set HI. In the where-clause we have to assure that the key attribute value of 
the reverse link of the link addressed in base set HI is “1”. In P-OQL the “<§>”- 
sign can be used to switch from a link to its reverse link. Hence, the condition 
“Hl:@no = 1” assures that the value of the key attribute of the reverse link 
is “1”. Starting with base set HI we then navigate to the attended-by links 
which are addressed in base set H2. The first “/” in the regular path expression 
“Hl:/.holds/-.attended_by->.” used for this purpose specifies that we navigate 
from the current link in base set HI to its destination object. Starting at this 
destination object we then traverse a holds links and reach the attended-by link. 
For the reverse links of these attended-by links the condition “H2:@no >= 2 
and H2: Q no <= 4” assures that the key attribute value is in the interval [2. .4], 
Finally the base set definition “A in (H2:/.)” addresses the destination objects 
of these links in base set A. 

3.3 Iteration Facilities 

For each employee living in Cologne the following example query calculates a set 
with the subjects of the projects with a duration of at least 36 month he works 
for. To this end the query iteratively follows the subproject-of links: 

select A:name, Eksubject 

from A in Employee, B in (A:_.works_for/[_.subproject.of]*/->.) 
where A:address = “Cologne” and B:duration >= 36 

The inversion of this regular path expression does not cause major problems, 
since we can use analog iteration facilities in the inverted version: 

select A:subject, B:name 

from B in Project, A in (B:[_.has_subproject]*/*.under_dev_by/->.) 
where A:address = “Cologne” and B:duration >= 36 

This alternative query formulation exploits that for each path pi match- 
ing the expression “_.works_for/[_.subproject_of ]*/->.” there is a reverse path p- 2 
matching the expression “[_.has_subproject]*/*.under_dev_by/->.” . Unfortunately 
the situation becomes much more complicated as soon as the iteration facilities 
are combined with conditions for the key attribute values. 

3.4 Iterations Facilities and Conditions on Link Keys 

The following query directly corresponds to the query considered in section 3.3 
except that we traverse only the first link of type subproject-of for each project 
— that means we traverse only links with the key attribute value “1 

1 Recall that the notation at the end of a regular path expression addresses the 

last link of the path, whereas the notation addresses the destination object. 

Further note that the is used in “B:*.under_dev_by->." to allow arbitrary string 
key attribute values. 
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select A:name, B:subject 

from A in Employee, B in (A:_.works_for/[l.subproject.of]*/->.) 
where A:address = “Cologne” and B:duration >= 36 

If we combine the solutions presented in the previous sections for conditions 
on link keys and for iteration facilities in a straight forward manner to invert 
the regular path expression “A:_.works.for/[l.subproject.of]*/->.” , we yield: 

select A:name, B:subject 

from B in Project, HI in (B:[_.has.subproject]*->.), 

A in (Hl:/*.under_dev_by/->.) 

where A:address = “Cologne” and B:duration >= 36 and Hl:@no = 1 

Unfortunately this query is not equivalent to the original query, because the 
condition “Hl:@no = 1” is checked only for the last link of each path matching 
the regular path expression “B:[_.has_subproject]*->.” . 

Therefore, we have to apply a different approach in situations where condi- 
tions for the key attribute values occur inside an iteration. To this end, we recall 
the basic aim of the query inversion. This aim is to apply an index structure for 
the destination object type of the original path expression. Let us assume for 
example that there is no index for the address attribute of the employees. Then 
in the above example query the base set definition “A in Employee” means that 
we have to scan all employees. On the other hand, there might be an index struc- 
ture for the duration attribute of the projects. If we assume that the condition 
“B:duration >= 36” is rather restrictive, there will be only few employees work- 
ing in such projects. Therefore it might be useful to consider not all employees, 
but only the employees working in such projects. The objects representing these 
employees can be determined by the following select statement: 

select distinct H2:. 

from HI in Project, H2 in (Hl:[_.has.subproject]*/*.under_dev_by/->.) 
where Hl:duration >= 36 

Two aspects have to mentioned with respect to this query: (1) We use a select 
distinct here to avoid duplicates in the result. (2) The query will in general return 
more employees than actually needed, because the conditions for the link key 
attributes given in the original query are not reflected here. 

Now we simply use this query instead of the object type Employee to define 
base set A in the original query: 

select A:name, B:subject 

from A in (select distinct H2:. 

from HI in Project, 

H2 in (Hl:[_.has_subproject]*/*.under_dev_by/->.) 
where Hl:duration >= 36 ), 

B in (A:_.works-for/[l.subproject_of]*/->.) 
where A:address = “Cologne” and B:duration >= 36 

Using this query an index structure for the duration attribute of the projects 
can be employed to speed up query processing. 
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3.5 Ambiguous Destination Object Types 

Another problem caused by the inversion of regular path expressions, is the ex- 
istence of ambiguous destination object types for some link types. In the schema 
given in figure 2 the link type advisor.of is such a link type with two destination 
object types (Student and Thesis ). So by inverting a regular path expression 
it may be necessary to address objects by using such a link type, but not all 
destination object types should be considered. The following example uses the 
link type has-advisor to define the dependent base set B: 

select A:name, B:name 

from A in Student, B in (A:.has.advisor/->.) 
where B:address = "Cologne” 

If we invert the regular path expression “A:.has.advisor/->.” , we have to use 
the link type advisor-of , but only destination objects of type Student should be 
addressed. To this end, we can use a type test predicate of P-OQL: 

select A:name, B:name 

from B in Employee, A in (B:_.advisor_of/->.) 
where B:address = “Cologne” and A:, is of type Student 

Similar situations can arise due to inheritance. To illustrate such a situation, 
we extend our university schema given in figure 2 by two subtypes for the object 
type Employee as shown in figure 3. As usual, in PCTE a subtype t \ of an object 
type to inherits the applied attribute types and the allowed outgoing link types. 
Furthermore, if to is defined as the destination object type of a link type, links 
of this type can as well point to objects of type ti , because an object of type t\ 
can be used wherever an object of type to is required. 



Professor 


< 


Employee 


3 =^ > 


Assistent 



Fig. 3. Two subtypes of the object type Employee 



To see the effects of inheritance for the inversion of regular path expres- 
sions, we reconsider the example query given in section 3.1 combined with the 
inheritance situation given in figure 3: 

select A:name, B:name 

from A in Student, B in (A:_. attends/. held.by/->.) 
where A:year = 4 and B:address = “Cologne” 

Now the inverted query has to address the object type Employee and its 
subtypes in the first base set. To this end, we can use the notation ObjectType * 
which addresses an object type and his descendant object types: 
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select A:name, B:name 

from B in Employee'' , A in (B:.holds/..attended.by/->.) 
where A:year = 4 and B:address = “Cologne” 



3.6 Link Categories Used in Regular Path Expressions 

The last feature of P-OQL we want to consider here, is the definition of the 
allowed link categories in a regular path expression. This feature is extremely 
useful when combined with the document retrieval facilities included in P-OQL 
[9]. Unfortunately we cannot present these facilities here due to the space limi- 
tations. As a consequence, the following example may seem relatively artificial. 

Assume the following query starting with projects and following reference 
links, which in this case lead to employees: 

select A:subject, B:name 

from A in Project, B in (A:{ r }/ ->.) 
where B:address != “Cologne” 

Since there is only one link type with category reference originating from the 
object type Project we can invert this query as follows: 

select A:subject, B:name 

from B in Employee, A in (B:{@r}/->.) 

where B:address != “Cologne” and A:, is of type Project 

Here the regular path expression “B:{@r}/->.” enforces, that the reverse link 
of the traversed link has the category reference. Since this condition is true for 
multiple outgoing link types of the object type Employee, we have to add the 
condition “A:, is of type Project” in the where-clause of the query. 

In general the situation is a bit more complicated, because there will be 
multiple destination object types for links with the required category. In this 
case the query optimizer has to look for a unifying supertype or it can build the 
union of the objects of the various types which can be accessed via sub-selects. 

4 The Optimizer Architecture 

In section 3 we have given various examples for situations where a query can be 
inverted to use another object type as the starting point for query processing. In 
the present section we will describe the architecture of our query optimizer which 
tries to exploit such inverted query formulations to speed up query processing. 

The query optimizer is implemented as a preprocessor. This preprocessor 
receives a query in P-OQL syntax and it returns a query in P-OQL syntax 
with some extensions specifying the index structures to be used. The different 
steps performed by our optimizer are shown in figure 4. The solid lines with 
the numbers (1) to (7) represent the main control flow whereas the dashed lines 
identified by the letters (a) to (e) represent the information transfer between 
those components of the optimizer which provide additional information for the 
optimization process. The main components can be described as follows: 




On the Optimization of Queries Containing Regular Path Expressions 



69 



Query Formulation 
Query Evaluation 
Result Presentation 



Query Transfer 



QueryEngine 



/( 6 ) 



Query Optimization 



^ (5) 



Constructor 

v 




(c) 



(7) 



Query Interface | 



DecisionModule 



Input 

(2 )i 

QueryParser 

7 

(3) 







A 


V 

> 


(b) 

X 


Statistic 

File 


(d) 


MetaData 


(e) 






for 

Database 


for 

Index 




CostAccount 



User 



Fig. 4. The optimizer architecture 



Querylnterface: The Querylnterface stores the P-OQL query in the Input- file 
(1) and expects the result for presentation to the user (7). 

QueryParser: This component parses the original query from the Input (2) and 
creates an internal representation which is passed to the DecisionModule (3). 
Thereby the QueryParser checks whether there are base set definitions using 
regular path expressions. In case, the corresponding base sets are marked. 

DecisionModule: The DecisionModule analyses the base sets of the actual 
query to detect interdependent base sets. For each group of interdependent 
base sets the definitions of the dependent base sets are scanned in order to 
evaluate whether they can be inverted. When the definition of a dependent 
base set e.g. includes a complex sub-select or a navigation over a designation 
link, an inversion is regarded as impossible. On the other hand in the cases 
explained in section 3 an inversion is possible. For each invertible base set 
definition the destination object type is determined. These destination object 
types and the object type addressed in the independent base set of the 
original query formulation built a set with object types which can be used 
as starting points of the query formulation. For these alternatives the query 
execution costs are estimated based on the information provided by the 
CostAccount module (b). For this cost estimation information about the 
cardinality of the base sets, information about the available index structures, 
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and information about the conditions in the where-clause of the query is 
considered. The decision for the alternative with the lowest expected costs 
is thereafter passed to the Constructor module (4). 

Constructor: This module actually performs the inversion of the query. To this 
end, the Constructor depends on information from the MetaData module (c), 
e.g. to determine the name of a reverse link type. 

MetaData: This module provides an interface to retrieve the relevant meta 
data for the database and the index structures. 

StatisticFile: To assure a minimum of information even if there exists no in- 
dex structure for an object type, this file contains e.g. an estimation of the 
number of instances of each object type. 

CostAccount: The CostAccount module computes the relative costs of the 
original query plan and the potential alternative query plans by using the 
available information from the MetaData module (e). The details of this 
module are discussed in section 5. 

QueryEngine: This module evaluates the final query by means of a nested-loop 
approach. For the base sets for which an index structure has been selected 
the index structure is employed. The result of the evaluation is sent to the 
Querylnterface (7), which presents it to the user. 

5 Cost Estimation and Index Selection 

For the discussion of the applied cost estimation technique it is important to 
mention that our implementation uses a multi-dimensional index structure. This 
index structure is presented in detail in [11, 10]. For the present paper it suffices 
to know that this index structure allows to address multiple attributes in one 
access structure 2 . A corresponding index definition might e.g. look as follows: 

Index for object type Student: 1. name, 2. year, 3. address 

Since the index structure is symmetric, the same selectivity is provided for 
all supported dimensions. Furthermore the index structure is well suited for 
situations where multiple attributes are addressed in the where-clause of a query, 
because in these cases the selectivity for all supported attributes is combined. 

In this context some administrative information is maintained for each index 
structure: ( 1) an object counter containing the number of instances of the indexed 
object type, (2) the number of dimensions of the index structure, (3) the indexed 
attributes, and (4) an object type hierarchy specification, defining whether the 
index is created only for objects of the defined object type itself, or for objects 
of the defined type and all descendent types. 

Based on this information for each object type j which could be used as 
the starting point of the query formulation the existing index structures are 

2 It should be obvious that the use of this multi-dimensional index structure is by no 
means a prerequisite for the application of our query optimization approach. If the 
approach is used with one-dimensional index structures — like the B-tree [5] — only 
the formulas presented in this section have to be slightly adapted. 
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determined. Now two cases have to be distinguished: (1) For none of the object 
types an index exists: In this case we use the information from the StatisticFile 
to select the object type with the fewest instances as the starting point of the 
query formulation. (2) Otherwise we always choose to apply an index structure. 
To select this index structure we proceed in two steps: In the first step we try 
to determine the “best” index structure for each object type j, for which index 
structures exist. In the second step we select the object type which should be 
used as the starting point of the query formulation. 

1. step: This step is performed for each object type which could be the starting 

point of the query and for which at least one index structure is defined with 
an object type hierarchy specification corresponding to the query. 

For the index structures of object type j we proceed as follows: 

1. We collect the conditions from the where-clause which refer to object 
type j. We denote the number of these conditions by qj. 

2. For each index i we count those conditions, which can be supported by 
the index. We denote this number by hj y, 

3. For each index i of object type j we calculate ry* = ^ (the share of 
the conditions for object type j which are supported by index i). 

4. Now we assume that the index structure with the highest value for ry,; 
is the “best” index structure for object type j. If there are multiple in- 
dex structures with this highest value, we choose the index which has 
the fewest dimensions. If this is not unique as well, we choose an arbi- 
trary index out of these index structures. Let i* denote the chosen index 
structure for object type j. 

2. step: Now we can choose the object type j which should be used as the 

starting point of the query formulation: For each object type j we calculate 
the value Cj = object counter x max(0.01, 1 — ry,- : ), where object counter is 
taken from the administrative information of the index structure i* . Since 
small values for object counter and high values for ry,;* , which always is a 
value out of [0,1], are desirable, we use 1 — ry,;* in the formula. We use 
max(0.01, 1 — ry,;*) because otherwise this factor would be zero for rjy* = 1, 
i.e. when all conditions are supported by the index structure. So to assure 
that the object counter can always influence the formula we use max(0.01, 1 — 
ry,;*). Finally the DecisionModule chooses the object type j with the lowest 
value for Cj as the starting point of the query formulation. 



6 Experimental Results 

In this section we present some experimental results achieved with our optimizer. 
The tests were performed on a Sun Sparc20 with two processors and 96 MB of 
main memory. The database for the tests was filled with synthetic data created 
for the university schema given in figure 2 and extended in figure 3. Table 1 sum- 
marizes the existing index structures for the object types Student and Employee 
which occur in our example queries. 

First we examined the example query explained in section 3.1: 
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Table 1. Index structures defined for the tests 



object type 


object 

counter 


index 

number 


number of 
dimensions 


attributes 


object type 
hierarchy 


Student 


6040 


1 


2 


name, term 


this type only 


Student 


6040 


2 


3 


name, address, term 


this type only 


Student 


6040 


3 


2 


name, address 


this type only 


Employee' 


1995 


1 


2 


name, address 


with subtypes 


Employee' 


1995 


2 


3 


name, address, office 


with subtypes 


Employee' 


1995 


3 


1 


office 


with subtypes 



select A:name, B:name 

from A in Thesis, B in (A:_. attends/. held.by/->.) 
where A:year = 4 and B:address = “Cologne” 

For this query the optimizer returned the following query, where the base set 
definition “B in #1# Employee" ” specifies that index 1 has to be used to speed 
up the loop over the employees: 

select A:name, B:name 

from B in #1# Employee" , A in (B:.holds/_.attended.by/->.) 
where A:year = 4 and B:address = “Cologne” 



Table 2. Performance results for the first test query 



using the index structures for the object type Student 


index number 


number of read index pages 


relative execution time 


1 


225 


72 % 


2 


192 


66 % 


3 


188 


66 % 


using the index structures for the object type Employee" 


index number 


number of read index pages 


relative execution time 


1 


47 


23 % 


2 


51 


26 % 


3 


115 


100 % 



The performance results presented in table 2 show that the optimizer in 
fact selected the best possible query plan for our example query. Without the 
inversion of the query the execution time would have been nearly three times 
higher. So the optimizer selected the correct object type to start with and it 
selected the best possible index structure for this object type. 
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In the second test query we are searching for the subjects of all theses advised 
by an employee named ‘‘Brubeck”: 

select A:subject 

from A in Thesis, B in (A:.has.advisor/->.) 
where B:name = “Brubeck” 

In this case the optimizer returned the following query: 
select A:subject 

from B in #1# Employee" , A in (A:_.advisor_of/->.) 
where B:name = “Brubeck” and A:, is of type Thesis 

Here the condition “A:, is of type Thesis” is inserted into the where-clause 
because of the ambiguous destination object types of the link type advisor.of. 



Table 3. Performance results for the second test query 



without using index structures for the object type Thesis 


index number 


number of read index pages 


relative execution time 


— 


— 


100 % 


using the index structures for the object type Employee" 


index number 


number of read index pages 


relative execution time 


1 


5 


6 % 


2 


6 


8 % 


3 


115 


75 % 



Table 3 shows the performance results for the second test query. Since there 
is no index structure defined for the object type Thesis we have included the 
time for the query execution without an index structure in the table. Again the 
optimizer chose the best possible query plan for our example query. 

It remains to mention that we have performed various other tests as well. 
The optimizer selected the best possible plan in over 80 % of the cases, and 
whenever the selected plan was not the best one, it was only slightly worse. 

7 Related Work 

Although there is a great number of approaches for query optimization prob- 
lems in the field of ooDBMS, to our best knowledge this is the first approach 
which tries to invert complex regular path expressions in order to exploit index 
structures defined for the destination object types of the paths. 

For a general overview on query optimization problems in ooDBMS we refer 
to [13]. In [15] Mitschang presents the basic concepts and implementation aspects 
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of query processing and query optimization. Furthermore multiple interesting 
articles dealing with various aspects of query optimization can be found in [7]. 
Finally some rule-based approaches to query optimization are presented in [6], 
[14] and [1]. However, all these papers do not address topics related to the 
potential inversion of regular path expressions. 

On the other hand some work dealing with path expressions and related 
aspects has been published, even though it is not directly concerned with the 
inversion of the path expressions. 

In [4] Christophides, Cluet and Moerkotte extend an object algebra with 
two new operators and present some interesting rewriting techniques for queries 
featuring generalized path expressions. However, their approach does not ad- 
dress the problem of determining applicable access structures or aspects of the 
inversion of regular path expressions. 

Ozkan, Dogac, and Evrendilek present a heuristic to determine the optimum 
execution order for the joins needed to process a path expression [16] . In contrast 
to our approach they assume that the references to objects are maintained as 
implicit joins which are converted to explicit joins during the optimization phase. 
Since we assume that the navigation via a link in PCTE is an extremely efficient 
operation supported by the physical structures of the OMS, these considerations 
are not applicable in our context. 

Other approaches deal with the exploitation of materialized views for query 
processing [3] or with a general architecture for a query optimizer for OQL [12]. 

Summarizing, all the aforementioned approaches do not address the aspect 
that regular path expressions can be inverted in order to exploit index structures 
defined for the destination object types of the paths. 

8 Conclusion and Future Work 

In this paper we have presented an approach which exploits the potential invert- 
ibility of regular path expressions in order to apply index structures defined for 
the destination object types of the original path expressions. The approach has 
been described for the environment of PCTE and P-OQL, but it can be easily 
adapted for ooDBMS with simpler link concepts than PCTE. 

Our future work will be concerned with an extension of the presented ap- 
proach to other facilities of the query language. One important point in this 
respect are sub-selects, which can often be eliminated or flattened and processed 
in the same way as regular path expressions. Another aspect are quantifiers. 
Especially existence quantifiers seem to be well suited for an integration in our 
approach. Also improved cost estimation techniques are an interesting research 
area, and finally a formal framework for our optimization approach is needed. 
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Abstract. Recently multidimensional arrays have received considerable 
attention among the database community, applications ranging from GIS to 
OLAP. Work on the formalization of arrays frequently focuses on mapping 
sparse arrays to ROLAP schemata. Database modeling of further array types, 
such as image data, is done differently and with less rigid methods. A unifying 
formal framework for general array handling of image, sensor, statistics, and 
OLAP data is missing. 

We present a cross-dimensional and application-independent algebra for the 
high-level treatment of arbitrary arrays. An array constructor, a generalized 
aggregate, plus a multidimensional sorter allow to declaratively manipulate 
arrays. This algebra forms the conceptual basis of a domain-independent array 
DBMS, RasDaMan, which offers an SQL-based query language with extensive 
algebraic query and storage optimization. The system is in practical use in neuro 
science. 

We introduce the algebra and show how the operators transform to the array 
query language. The universality of our approach is demonstrated by a number 
of examples from imaging, statistics, and OLAP. 



1 Introduction 

In principle, any natural phenomenon becomes spatio-temporal array data of some 
specific dimensionality once it is sampled and quantised for storage and manipulation 
in a computer system; additionally, a variety of artificial sources such as simulators, 
image Tenderers, and data warehouse population tools generate array data. The 
common characteristic they all share is that a large set of large multidimensional 
arrays has to be maintained. We call such arrays multidimensional discrete data 
(MDD), expressing the variety of dimensions and separating them from the concep- 
tually different multidimensional vectorial data appearing in geo databases. 

As arrays obviously form both an important and a very clearly defined information 
category, it seems natural to describe them in a uniform manner through a 
homogeneous conceptual model. Preferably this is done in a way that the array model 
smoothly fits into existing overall models. 
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From a database perspective (and history), several separate information categories 
can be distinguished. Sets comprise the first category, well addressed by relational 
algebra and calculus. Semantic nets form the second one, being fundamentally 
different in structures and operations, although mappings to the relational model have 
been studied extensively. The third fundamental category is text, addressed by infor- 
mation retrieval (IR) technology. This distinction is not withstanding the fact that 
techniques to map object nets to relations with foreign keys are well-known, that IR 
techniques have found their way into relational products (e.g., Oracle8), and that 
hypertext combines nets and text into so-called semi-structured data. Arrays represent 
a separate fourth category, substantially different from the previous three. Again, 
mappings have been developed to the relational model, however, involving a signific- 
ant semantic transformation. For sparse business data these are star, galaxy, and 
snowflake techniques [1], for image data these are blobs [2] (where a particularly high 
loss of semantics is incurred). A clear indicator for the semantic mismatch of SQL- 
based multidimensional queries is the resulting lack of functionality and performance, 
leading to several suggestions for extending the relational model - e.g., [3, 4, 5], 

Multidimensional database research has history, as statistical databases have been 
studied since long [6, 7]; more recently, OLAP continues this tradition with a strong 
focus on business data [8], Several proposals exist to formalize array structures and 
operations for OLAP [4, 3, 9, 10, 11, 5], for scientific computing [12] and for imaging 
[13], The Discrete Fourier Transform (DFT) has been studied from a database 
viewpoint [14]. Often, however, formal concepts have not been implemented in an 
operational system and they have not been evaluated in real-life applications. 
Moreover, many of the formal models have been designed specifically with the goal of 
mapping arrays to relation tuples and in a way that, in practice, makes sense only for 
sparse arrays. 

In this paper, we propose an algebraic framework (see [15] for a first version) 
which allows to express cross-dimensional queries, i.e., operations on arrays of any 
number of dimensions, simultaneously in one and the same expression and symmetric 
in all dimensions. Essentially, this algebra consists of only three operations: an array 
constructor, a generalized aggregation, and a multidimensional sorter. This core model 
does not rely on recursion and is safe in evaluation, yet it is sufficient to express a 
wide range of imaging, statistical, and OLAP operations. Therefore, our algebra can 
be seen as a “universal” framework, independent from the particular application 
domain. The concepts are implemented in the domain-independent array DBMS 
RasDaMan 1 [16, 15, 17], hence the name RasDaMan Array Algebra, or short: Array 
Algebra. 

The remainder of this paper is organized as follows. In Section 2, Array Algebra is 
presented, together with practical examples from diverse application fields to illustrate 
its applicability. The step to an SQL-embedded array query language, RasQL, is 
shown in Section 3. Section 4 surveys related work, and Section 5 summarizes our 
findings. 



1 Ras ter Data Man ager: see www.forwiss.de/~rasdaman 
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2 Array Algebra 

We treat arrays as functions mapping n-dimensional points (i.e., vectors) from 
discrete Euclidean space to values. This is common in imaging for a long time - see, 
e.g., [18] - and has been transposed to database terminology in [16, 15]. To smoothly 
embed Array Algebra into existing overall algebrae we use a set-oriented basis. Due to 
space constraints we have to omit most proofs here. 

Operations on such arrays frequently apply a function simultaneously to a set of 
cells, requiring second-order functionals in the algebra. In practice they are necessary 
to allow for binding variables to points for iterating coordinate sets and also to 
aggregate arrays (or part thereof) into scalar values. The latter operation corresponds 
very much to relational set aggregators; however, instead of providing a limited list of 
aggregation operations as in the relational algebra, a general constructor is introduced 
by Array Algebra which is parametrized with the underlying base operation. 



2.1 N-Dimensional Interval Arithmetics 

We first introduce some notation for n-dimensional integer interval arithmetics. We 
call the coordinate set of an array its spatial domain. Informally, a spatial domain is 
defined as a set of n-dimensional points (i.e., algebraic vectors) in Euclidean space 
forming a finite hypercube with boundaries parallel to the coordinate system axes. 

We assume common vector notation. For a natural number d>0, we write 

x= (x^,..., x d )eX <z Z d for some d-dimensional vector x, x+y for vector addition, 
etc. The point set forming the geometric extent of an array is called its spatial domain. 
A spatial domain X of dimension d spanned by 1 and h is defined as 

X = [l 1 :h 1 ,..„l d :h d ] := X {x ± : 1 ± < X, < h ± } if Vl<i<d : l.Sh. 

i=l 

: = { } otherwise . 

Functions lo, hi: P (Z d ) — >Z d (where P is the Powerset) defined as lo (X) =1 
and hi (X) =h for some spatial domain X given as before denote the bounding 

vectors. We will abbreviate lex (X) =D and hi^ (X) =fx for the i th component. 
Function dim(X)=d denotes the dimension of spatial domain X. 

On such hypercubes, point set operations can be defined in a straightforward way. 
We admit only those operations which respect closure, such as intersect and 
union*, whereby the asterisk denotes the hull operation applied to the result: 

intersect* (X, Y) := 

[ max (low (X) , low (Y) ) : min (hi (X) , hi (Y) ) ,..., 

max (low d (X) , low d (Y) ) : min (hi d (X) , hi d ( Y) ) ] 
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union* ( X, Y ) := 

[ min (low (X) , low (Y) ) : max (hi (X) , hi ( Y) ) 

min (low d (X) , low d (Y) ) : max(hi d (X), hi d ( Y) ) ] 

Obviously these operations are commutative, associative, and distributive. 

The shift operator allows to change a spatial domain’s position according to a 
translation vector t : 

Shift (X) := { x+t: xeX } 

Let X be spanned by d-dimensional vectors 1 and h. For some integer i with 

l<i<d and a one-D integer interval I = [m.-n] with l^<m<n<h d , the trim of X to I in 
dimension d is defined as 



T i x (X) := { xeX: m<X d <n } = [ 1 1 :h 1 ,...,m:n,...,l d :h d ] 

Intuitively speaking, trimming slices off those parts of an array which are lower 
than m and higher than n in the dimension indicated; the dimension is unchanged. As 
opposed to this, a section cuts out a hyperplane with dimension reduced by 1. Formal- 
ly, for some X as above, an integer p with l<p<d, the section of X at position p in 
dimension i is given by 



a. (X) 

i,p 



:= { xez d 



x= (x 1 ,..., x i ;L , 



x 



i+l’ 



x 



d' ' 



(x 1 ,..„ x._ 1# p, x. +1 „.„ 

[ l 1 :h 1 ,.„, 1 i + i : h i + i>-> ] 



x^eX } 

. , : h, 
d d 



Trimming is commutative and associative, whereas a section changes dimension 
numbering and, therefore, has neither of these properties. 



2.2 The Core Algebra 

Let XcZ d be a spatial domain and F a homogeneous algebra. Then, an F -valued d- 
dimensional array over spatial domain X - or short: ( multidimensional ) array - is 
defined as 

a:X— >F (i.e., aeF x ), a = { (x,a(x)): xeX, a(x)eF } 

Array elements a(x) are referred to as cells. For notational convenience, we also 

allow to enumerate the components of a cell coordinate vector, e.g., a(x 1 ,x 2 ,x 3 ). 
Auxiliary function sdom(a) denotes the spatial domain of some array a; further, we 
lift function dim to arrays. For an array a:X— »F, sdom and dim are defined as 

sdom(a) := X 

dim (a) := dim ( sdom (a) ) 
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The i ,h dimension range of an array’s spatial domain we will denote by sdonu (a). 
Example: For a 1024x768 image a with lower left corner in the origin of the coordi- 
nate system, sdom(a)=[0: 1023,0:767], dim(a)=2. 

The first functional we introduce is the array constructor MARRAY. It allows to 
define arrays by indicating a spatial domain and an expression which is evaluated for 
each cell position of the spatial domain. An iteration variable bound to a spatial 
domain is available in the cell expression so that a cell’s value can depend on its posi- 
tion. Let X be a spatial domain, F a value set, and v a free identifier. Let further be 
an expression with result type F containing zero or more free occurrences of v as 
placeholder! s) for an expression with result type X. Then, an array over spatial 
domain X with base type F is constructed through 

MARRAY X (e ) = { (x, a (x) ) : a(x)=e x ,xeX } 

Example: Consider scaling of a greyscale image a with sdom(a) = [l:m,l:n] by a 
factor se R. We assume componentwise scalar division and rounding on vectors and 
write 



MARRAY , , . , { a ( round! v/s ) ) ) 

[1 :m*s, 1 :n*s] , v ' 

For 0<s<l the image is sized down; the interpolation method then corresponds to 
“nearest neighbor”, the simplest interpolation technique used in imaging. 

The operation which in some sense is the dual to the MARRAY constructor is the 
condenser COND. It takes the values of an array’s cells and combines them through the 
operation provided, thereby obtaining a scalar value. Again, an iterator variable is 
bound to a spatial domain to address cell values in the condensing expression. Let o be 
a commutative and associative operation with signature o: F, F — > F, let further v be a 

free identifier, X = { x ,..., x n | x^eZ 1 } a spatial domain consisting of n points, and 

e a an expression of result type F containing occurrences of an array a and 
identifier v. Then, the condense of a by o is defined as 



COND v !e ) 
o , X, v a, v 



:= O 



xeX 



e = e 

a, 



xl 



o e 



a, xn 



Examples: Let a be the image as defined in the above example. Average pixel 
intensity is given by 

COND + sdom(a) v (a (v) ) / | sdom (a) | = 2>[x]/(m*n) 

' ' xe[l:m,l:n] 



For color table computation needed, e.g., for generation of a GIF image encoding, 
one has to know the set of all values occurring in array a. The condenser allows to 
derive this set by performing the union of all cell values: 
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COND 



U, sdom (a) 



jV ( { a(v) } ) 



The third and last operator is an array sorter which proceeds along a selected 

dimension to reorder the corresponding hyperslices. Function sort rearranges a 
given array along a specified dimension s without changing its value set or spatial 
domain. To this end, an order-generating function is provided which associates a 
“sequence position” to each (d-l)-dimensional hyperslice. Let a be a d-dimensional 

array, ieN with l<i<d a dimension number, and f :sdom (a)— >N a total 

S / 3. S 

function which, for a given array a, inspects a in the sorting dimension s and delivers 
an ordering measure for each hyperslice. Further, let perm(x,y) be a predicate 
indicating that vector x is a permutation of vector y (and vice versa). Then, the two 

sorters sort g f asc and sort g f de ° c for ascending and descending order, resp., are 
given as those arrays which consist of permutations of the hyperslices in the sort 
dimension and, additionally, fulfil the sorting criterion given by f : 



clSC , » 

sort g f ( a ) := 

{ (y , b (y) ) : ye sdom g (a) , 

Vp, qe sdom s (a) : p<q => f g ^ b (p) <f g ^ b (q) , 



(b (x r . 


..,x . , sdom 

’ s-l s 


(a) 


.lo,x .,. 
s+l’ 


■”X d )> •• 


•’ 


b (x r . 


..,x . , sdom 

’ s-l s 


(a) 


.hi,x 

' s+l’ 


■•’ x d ) >■ 




(a (x r . 


..,x . , sdom 

’ s-l s 


(a) 


.lo,x .,. 
s+l’ 


■•’ x d )’ •• 


■> 


a (x r . 


..,x . , sdom 

’ s-l s 


(a) 


.hi,x . ,. 
' s+l’ 


■•’X d )) : 


' } 



desc , . 

sort g f ( a ) := 

{ (y,b(y)): ye sdom g (a) , 

Vp, qe sdom g (a) : p<q => f g ^ b (p) >f g ^ b (q) , 



perm( ( 


: (b(x 1 ,. 


,.,x . , sdom 

’ s-l s 


(a) 


.lo,x .,. 
s+l’ 


..’X d )’..„ 






b (x iv 


,.,x . , sdom 

’ s-l s 


(a) 


.hi,x 

' s+l’ 


■•’X d ) 




i 


: (a(x 1 ,. 


..,x . , sdom 

’ s-l s 


(a) 


.lo,x .,. 
s+l’ 


■•’ x d ) ’ 






a (x 1 ,. 


,.,x . , sdom 

’ s-l s 


(a) 


.hi,x . ,. 
' s+l’ 


■•’ x d ) ) : 


1 } 



The resulting array has the same number of dimensions, spatial domain, and base 
type as the input array. Note that function f has all degrees of freedom to assess 

S , 3 *■“' 

any of a's cell values for determining the measure value of a hyperslice on hand - it 
can be a particular cell value in the current hyperslice, the average of all hyperslice 
values, or even neighbored slices (e.g., for relative increases of sales values). 
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Example: Let a be a 1-D array with spatial domain D=[l:d] where cell values 
denote sales figures over time. Let further sorting function f be given as f (p) 

= a[p]. Then, sort n desc (a) delivers the ranked sales. 

As an aside we note that the sort operator includes the relational group by. Below 
we will demonstrate that slice and roll-up operations arising from array access based 
on dimension hierarchies can be expressed, although - not very comfortably - by indi- 
cating the cell coordinates pertaining to a particular member set. Concepts for an in- 
tuitive, symbolic treatment of dimension hierarchies are currently under investigation. 



2.3 Derived Operators 

Several useful operations can be derived from the above ones. We present a selec- 
tion of those which have turned out particularly important in practical applications. 

2.3.1 Trimming and Section 

The previously introduced spatial domain operations trimming and section give rise 
to corresponding array operations. For some array a, an 1-D interval I, and two natu- 
ral numbers l<t<dim(a) and pe sdom^la) they are defined as 

TRIM (a) : = MARRAY (a(v)) for X=T_ T (sdom (a) ) and d<dim (a) 

A r V tl , _L 

SECT, (a) := MARRAY,, (a(v)) for X=0, (sdom (a) ) and d<dim (a) 

t,p X, v t , p 

Example: Slicing of an OLAP cube c with spatial domain sdom(c)=DXRxP to 
extract subcube D ' XR ' XP ' c sdom(c) is denoted as 

TRIM ,( TRIM , (TRIM , (c) ) ) 

2.3.2 Induced Operations 

A basic set of operations is induced by the algebra of the underlying value sets. If 
a , be F x are arrays and o is a binary operation on F, then o induces a binary operation 

on F x denoted by o^ nd such that, if c = a o ind b, then ce F x and, for all xe X , c(x) 
= a(x) o b(x). Along this line, we also allow to induce unary operations. Notably, 
these operations are not axiomatic; for a unary function f and a binary function g, 

IND, (a) = MARRAY (f(a(v))) for X=sdom(a) 

I A , V 

IND (a,b) = MARRAY (g (a (v) , b (v) ) ) for X=sdom (a) =sdom (b) 
g a, v 

Algebraic properties of F transform to corresponding structures on the set F x of in- 
duced functions. If F is a field, then F is a vector space; for a ring F, F is a module 
for suitably defined spatial domains. 
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Examples: Let a be a grayscale image over spatial domain X. Increasing intensity 
by 5 can be accomplished through induction on unary "+5": 

IND +5 (a) = { (x , b (x) ) : b (x) = a (x) + 5, xeX } 

Consider now another grayscale image b over the same spatial domain X. Then, 
pixel addition can be induced to obtain image addition: 

IND (a,b) = { (x, c (x) ) : c (x) = a (x) + b (x) , xeX } 

When used in a query, binary induction obviously implies a spatial join. 

Let now c be a color image where the cell type is a three-integer record of red, 
green, and blue intensity, resp. Such a pixel-interleaved image is transformed into a 
channel-interleaved representation, i.e., three separate color planes, through induction 
on the record access operator obtaining 

< c.red, c. green, c.blue > 

The above type of induction is also referred to as pointwise induction, as points 
pairwise match for each application of the base function. 

2.3.3 Aggregation 

Obviously, the condenser provides the appropriate basis for aggregation over 
arrays. Table 1 lists some of the most common aggregations and their definition in 
Array Algebra. 



Table 1: Some possible aggregate operators on arrays. Assumed are array expressions 
a (without restriction), b of result type Boolean, and c with a numerical result type. 



Array aggregate definition 


Meaning 


Count_cells( a ) 


The number of cells in a 


Some_cells( b ) 


Is there any cell in b with value true? 


All_cells( b ) 


Do all cells of b have value true? 


Sum_cells( c ) 


The sum of all cells in c 


Avg_cells( c ) 


The average of all cells in c 


Max_cells( c ) 


The maximum of all cells in c 



2.4 Further Application Examples 

A basic requirement in the development of Array Algebra has been to cover all 
applications of arrays in databases, the most important ones being statistics, OLAP, 
and imaging. To illustrate applicability of Array Algebra to these areas, we now 
present some advanced examples. 





84 



Peter Baumann 



2.4.1 Statistics 

Example matrix multiplication: Let a be an mxn matrix and b an nxp matrix. 
Then, the mxp matrix product 

a *b = jr a ,.*b. k 

7=1 



in Array Algebra is expressed as 



MARRAY 



[l:m, 1 : p] , ( i , k) 



( COND + 



[l:n] 



j ( a (i , j ) *b ( j , k) ) ) 



Example auto correlation: For two observation vectors x and y of dimension n, 



empirical covariance m is defined as 
x, y 



1 n 

-x ){y t -y ) 

n '\=\ 



In Array Algebra, the mean is given by x avg = COND + [i- n ] ^( x (i)) / n and 

Y avg = COND + [i- n ] ^(y(i)) / n. Then, m x is described in a straightforward 
manner: 

C0ND + ,[l:n],i ( (xfD-Xav^Mytbl-yavg) ) / (n-D 

Example histogram: A histogram contains, for each possible value, the number of 
cells conveying this value. For some n-D one-byte integer array with intensity values 
between 0 and 255, the histogram is computed as 



MARRAY 



[0:255] , n 



( 



COND . , , (if a (v) =n then 1 else 0 fi) ) 

+ , sdom (a) , v 



As we can see, the combination of MARRAY and COND appears in quite different 
contexts. Indeed, this type of operation forms the basis for an extremely wide range of 
analysis functions, such as statistical analyses, advanced OLAP consolidation 
operations like roll-up, slice&dice, as well as scaling, convolutions and filtering in 
image processing. It is capable of completely changing dimensionality, size, and cell 
types of arrays. 



2.4.2 OLAP 

Example roll-up: Let c be a sales datacube with DXRXP=[l:d,l:r,l:p] as spatial 
domain where dimension 1 counts days from 1 to today, dimension 2 enumerates 
sales regions, and dimension 3 contains products sold; cell values shall represent sales 
figures. The weekly average of sales per product and region, then, is expressed as 
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MARRAY [ 1 ; toda y/ 7 ] XRX p t ( w ,r,p) ^ 

COND +j [Q;6] jd ( c (7*w+d, r,p) / 7 ) ) 

Aggregating over all products leads only to a slight change in the expression 



MARRAY 



[1 : d/7] XR, (w, r) 



( 



COND 



+ , [0 : 6] XP, (d, p) 



c (7*w+d, r, p) / 7 ) ) 



Notably, such queries can be of considerable length when formulated relationally, 
and usually involve several joins. 

Example top performers : On the same cube, the top performing weeks are deter- 
mined as follows. We use the notation <s : sales , w : week> to describe a two-com- 
ponent record with component names sales and week. With function f c given as 

f (i)= c [ i] . sales, the following expression rolls up this cube from days to weeks 
and delivers the accumulated sales over all regions and products of the top 3 weeks: 



MARRAY [l:d/7],w ( 

<C0ND + , [0:6]XRXP, (d,r,p) (G[ 7 * W+d ' r 'P ] } :SaleS ' W:Week> ) ) 

[1:3] .week 

The last query heavily makes use of the fact that coordinate and cell values (dimen- 
sion and measure elements in OLAP terminology) can be used interchangeably. 



2.4.3 Imaging 

Example skewed section: A skewed section through a 2-D image where the cutting 
line is not axis-parallel (Fig. 1) can be described by placing a skew factor s>l on the 

indexing point x, resulting in the shifted point position (s*x 1 ,x 1 ): 



MARRAY 



sdom2 (a) 



, v 



a(s*v 1 ,v 1 ) ) 



Using the contents of another (1-D) array for indexing allows to pick arbitrary cells. 
Let a have sdom(a)= [l:m,l:n] and s be a 1-D array with spatial domain X=[ 1 :n] . 
Cell values of array s are used to index a (Fig. 1): 



MARRAY , ... ( a (s (x) ,x) ) 

sdoml (a) , x 

Example filtering: The following expression, parametrized over array a and mask 
m (such as the edge detector illustrated in Fig. 2) can be used as a template for general 
filtering operations: 
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f (a,m) = MARRAY , , , ( COND , , . (a (x+y) ) *m (y) ) ) 

sdotn(a) ,x + , sdom (m) , y 1 1 
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Fig. 1 : Skewed section (left) MARRAY j 0 . 6 j x ( a(x*2,x) ) of 2-D array a with 

sdom(a)=[0: 12,0:6] and user-defined section (right) MARRAY r . ( b(s(x),x) ) 

of 2-D array b with sdom(b)=[0: 10,0:6]; selected cells are shaded. 
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Fig. 2: 2-D Sobel filter mask m = { ((-1,1),1), ((0,1), 3), ((1,1), 1), ((-1,0),0), 
((0,0), 0), ((1,0), 0), ((-1,-1), -1), ((0,-1), -3), ((1,-1), -1) }. 

The spatial domain center (0,0) is marked by a double box. 

Through instantiation with mask s 1 as given by Fig. 1 and mask s 2 as the trans- 
pose of s 1 , we can express the Sobel edge detector (see Fig. 2 for mask definition, 
Fig. 3 for an application example): 

( | f(a,s 1 ) | + | f(a,s 2 ) | ) / 9 




Fig. 3: Sobel filter applied to a 2-D raster image. 
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We observe that in many cases operations can be formulated without explicitly re- 
ferring to the array dimension, allowing to develop parametrized cross-dimensional, 
domain-independent query libraries which go well beyond the capabilities of object- 
relational ADTs [19]. 



3 From Array Algebra to RasQL 

In this Section, we sketch how Array Algebra translates to the query language 
RasQL. Array Algebra has been developed in the course of implementing this fully- 
fledged, domain-independent array DBMS based on an SQL-based query language, 
obeying strict data independence. Arrays are embedded as a data abstraction allowing, 
e.g., to define array-valued object or tuple attributes, depending on the hosting data 
model. 

Arrays can be defined either concisely with dimension and extent per dimension 
fixed, or with a fixed number of dimensions but free lower or upper bounds in some 
dimension(s), or with dimension and boundaries left completely open. Runtime range 
checking on instances, then, is performed according to the amount of information 
provided in the data dictionnary. 

RasDaMan commits itself to the ODMG standard [20], hence the RasDaMan query 
language RasQL also follows the flavour of ODMG’s OQL which, in turn, leans itself 
on standard SQL-92. Queries range over collections which contain the class extents. 
Array expressions can appear both in the select and in the where clause of a 
query. The MARRAY equivalent in RasQL has the structure 

marray <iterator> in <spatial domain> 
values <expression> 

The COND statement is somewhat extended. In the syntactic structure 
condense <op> 

over <iterator> in <spatial domain> 

where <condition> 

using <expression> 

The where condition allows to further restrict the cell set inspected. This makes 

thresholding and similar tasks more elegant to phrase and, in particular, supports op- 
timization. 

Optimizing MARRAY and COND expressions is not easy (although not impossible) 
due to the generality of the operators. We therefore continuously investigate on special 
cases where particularly efficient solutions exist; a rich set of over 100 rules has been 
identified and implemented yet [21]. For optimizability reasons and due to their 
practical importance, trimming, section, and induction are supported by special con- 
structs; likewise, the condenser specializations mentioned in Table 1 are supported 
directly. Table 2 demonstrates some of these constructs with the help of application 
examples. 
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Table 2: Sample RasQL queries. We assume array-valued attributes for 2-D Land- 
sat satellite images, 3-D volumetric images, and 3-D OLAP cubes, to be embedded in 
object classes (or relations, resp.). 



Algebra 


RasQL example 


Explanation 


operator 


IND f 


Select img + 5 

from Landsatlmages as img 


The red channel of all 
Landsat images, 
intensified by 5 


IND 

g 


Select oid(br) 
from Brainlmages as br, 

BrainAreas as mask 
where br * mask > t 


OIDs of all brain images 
where, in the masked 
area, intensity exceeds 
threshold value t 


TRIM 


Select w[ *:*, yO:yl, zO:zl ] 
from Warehouse as w 


OLAP slicing (“*:*” 
exhausts the dimension) 


SECT 


Select v[ X, *:*, *:* ] 
from Volumetriclmages as v 


A vertical cut through 
all volumetric images 


MARRAY 


Select marray n in [0:255] 
values 

condense + 

over x in sdom(v) 

using v [x] =n 

from Volumetriclmages as v 


For each 3-D image its 
histogram 2 


COND 


Select condense + 

over x in sdom(w) 

using w [x] > t 

from Warehouse as w 


For each datacube in the 
warehouse, count all 
cells exceeding thres- 
hold value t 



A trim expression in the left-hand side of an update assignment indicates the array 
part to be updated: 

update <collection> 

set <array attributes [ctrim expressions] 
assign <array expressions 
where . . . 

Besides updating part of an array, this statement can also be used to extend an array 
by appending data (e.g., during periodical warehouse population or slicewise insertion 



2 RasQL supports the interpretation of Boolean values as numerics as is usual in many 
programming languages. 
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into a 3-D image), provided the affected dimension has been defined variable in the 
attribute definition. Formally the process of extending an array a in direction i with 
another array b matching a in all dimensions (with a possible exception in dimension 
i) and base type is governed by the algebra expression as follows. 

Let t=(0,...,0,sdonu (a),0,...,0) be the translation vector consisting of zeros except 
for component i<d. Then, 
extend (a, b, i) := 

MARRAY , , , , , . ( 

sdom (a) ushiftt (sdom (b) ) , v 

if vesdom(a) then a (v) else b(v-t) fi 

) 



4 Related Work 

In this Section we survey related work in formalization of arrays in databases; in 
part we rely on the classification published in [22] which gives particular attention to 
the independence of array formalisms from their implementation. 

Let us start, however, with APL [23,24] as a prominent representative from the pro- 
gramming languages area, dedicated to n-dimensional array manipulation. APL basi- 
cally has to be compared not with the algebra but with the query language RasQL. On 
this level, both share interesting implementation problems which, however, are out of 
scope here. Conceptually, Array Algebra functionality has its respective counterparts 
in APL: the enclose operator "c" corresponds to the MARRAY constructor; the re- 
duction operator "/" corresponds to the condenser, but is applied only to the outermost 
dimension; the each operator """ and, to some extent, scalar functions correspond to 
unary and binary induction. As a programming language with procedural constructs 
and recursion, APL is more powerful than Array Algebra, but not safe. 

In the database world, the algebra underlying the EXTRA/EXCESS database 
system supports 1-D, variable-length arrays [25]. As for the operators, there is a 
function SUBARR corresponding to the Array Algebra operator SECT, and 

ARR_APPLY corresponding to our unary induce IND f . Aggregation is not supported. 

Gray et al. [4] propose an SQL cube operator which generalizes group-by. It is 
based on a particular mapping of sparse arrays to relations. There is no clear separa- 
tion between conceptual (multidimensional) model and the proposed SQL extensions; 
specifically, no formal algebra is provided. 

In [3] a formal model for sparse array maintenance in relational systems (ROLAP) 
is presented. Array data are organized into one or more hypercubes whereby a cell 
value can either be an n-tuple (i.e., one nesting level of record elements) or a Boolean 
value denoting existence of the respective value combination. The algebra consists of 
a set of basic operations which are parametrized by user-defined functions. For 
example, operations pull and push increase/decrease, resp., a cube's dimension by 
changing coordinate values to cell contents and vice versa; in our example top per- 
formers we demonstrate how this is done in Array Algebra. Further, there is a join 
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operation to combine two arrays sharing k dimensions. The “join partners” are speci- 
fied through user-defined functions outside the formalism. The same way aggregation 
is handled through functions outside the formalism as opposed to Array Algebra 
where the condenser serves to describe aggregation. 

Cabibo and Torlone [9] propose a more “cube-oriented” formal multidimensional 
model and a corresponding query language based on a logical calculus. The data 
model relies on the notion of n-dimensional f-tables, i.e., (mathematical) relations 
where each cell is represented by a tuple of n coordinates and the cell value itself, 
which must be atomic. Aside of the usual logical quantifiers and connectors there are 
scalar and aggregation functions which are user-defined, hence outside the formalism. 
The equal treatment of coordinates and cells like in Array Algebra is possible, 

Li and Wang [10] formalize a multidimensional model for OLAP. Core is an 
algebraic query language called grouping algebra which treats arrays as sets of rela- 
tions plus an associated cell value which must be scalar. Operations on arrays are add 
dimension, transfer, union, aggregation, rc-join (relation/cube join), and construct 
(build array from relation). The algebra includes relations so that it can be seen as an 
extension of relational algebra. The model is very powerful, particularly in grouping, 
ordering, and aggregation. In [11] an algebra and calculus for multidimensional OLAP 
is presented. A multidimensional tabular database is defined as a set of tables. The 
model is close to the way OLAP arrays are mapped to relational star schemata. No 
direct mechanism is provided for join and aggregation; as by definition all first-order 
definable classification and aggregation functions are incorporated, these constructs 
can be expressed, too. Implementation of the model relies on an SQL mapping. 
Further important recent work in the field is described in [26, 27, 28]. 

In [12], an array query language, AQL, relying on Lambda calculus is presented 
which is geared towards scientific computing. AQL offers powerful operations on 
multidimensional arrays, with only slightly less generality in the aggregation mecha- 
nism than Array Algebra. The model has been implemented as a front end for query- 
ing arrays maintained in files using a geo scientific data exchange format. 

An Array Manipulation Language, AML, is introduced in [13]. Two operators 
serve to subsample and interleave, resp., arrays based on bit sequences governing cell 
selection. The third operator, APPLY, corresponds to induce operations modulo the bit 
pattern for cell selection. Bit patterns are modeled in Array Algebra through 1-D bit 
arrays executing the same control function. AML is more restricted than Array Alge- 
bra in that such control arrays cannot contain arbitrary values (e.g., weights), and 
moreover are constrained to 1-D. According to the authors, the main application area 
of AML is seen in imaging. 

In summary, most array frameworks nowadays are geared towards OLAP tasks, 
without regarding, e.g., spatio-temporal array application fields. Conversely, frame- 
works such as AML aiming at imaging do not consider OLAP. Sometimes array itera- 
tion or aggregation retracts to user-defined functions which, in an implementation, 
makes optimization difficult. All operations such as aggregation and spatial join found 
in these approaches are expressible in Array Algebra, too, except that dimension hier- 
archies usually are supported by convenient mechanisms, a feature still to be included 
in Array Algebra. 
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5 Conclusion 

Let us be philosophic for a moment. Loosely speaking, for every data abstraction a 
corresponding basic operation exists. For instance, component access by name corre- 
sponds with records (in C/C++: “structs”), whereas linear traversal relates to lists and 
sets. For arrays, usually (nested) loops are seen as the “natural” operation, exploiting 
the per dimension linear ordering of cell indices. Ordering, however, is not the es- 
sence. Instead, the neighborhood defined by the indices is the crucial property: con- 
solidation of a data warehouse cube, say, to derive weekly figures from daily data, 
involves seven neighbored array cells for every derived cell value. Likewise, edge 
detection in a 2-D raster image involves an nxn neighborhood of each pixel for com- 
putation of the result pixels. Hence, we claim that not iteration is the operation char- 
acteristic for arrays, but (conceptually) simultaneous computation of all result array 
cells, in general based on the evaluation of some neighborhood for each cell. 

RasDaMan Array Algebra has been designed as an algebraic framework for multi- 
dimensional arrays of arbitrary dimension and base type. Essentially two functionals 
and a sorter are sufficient for a broad range of statistics, OLAP, and imaging opera- 
tions. They are declarative by nature and do not prescribe any iteration sequence, 
thereby opening up a wide field for query optimization and parallelization. Array 
Algebra is minimal in the sense that no subset of its operations exhibits the same ex- 
pressive power. It is also closed in application: any expression is either of a scalar or 
an array type. Finally, Array Algebra does not rely on any external array handling 
functionality (“user-defined functions”) aside of the operations coming with the alge- 
braic structure of the cell type. By making all operations explicit, query optimization 
is eased considerably. 

Array Algebra comprises the formal basis for the domain-independent array DBMS 
RasDaMan [17] vendored by Actived Knowledge GmbH 3 . The query language RasQL 
supports declarative array expressions embedded in standard SQL-92. The array query 
optimizer relies on about 150 algebraic rewrite rules on logical and physical level [21]. 
Streamlined storage management allows to distribute arrays across heterogeneous 
storage media. RasDaMan is being used, e.g., in the European Human Brain Database 4 
for WWW-based access to 3-D human brain images. 

Future work on the conceptual level will encompass domain-specific features such 
as dimension hierarchies for OLAP, including symbolic dimension handling instead of 
the pure numbering scheme used now, and vector/raster integration for geo applica- 
tions. On the architectural level, extending RasQL functionality based on further 
benchmarking [17] will keep us busy for some time. 



3 See www.active-knowledge.de/ 

4 see www.dhbr.neuro.ki.se/ECHBD/Database/ 
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Abstract. Object-oriented databases has rich semantics, which enables 
the definition of various relationships among objects. Sharing levels 
and composition types necessitate the definition of whether, and to 
which extent, should a composed object propagate a message it re- 
ceives, to its composing objects (propagation rules). Current solutions 
refer to a system with stable connections, so propagation values can be 
set at the design stage. 

Turning a compound object into a distributed collection of simpler 
ones, some of which are shared, necessitates defining exact protocols 
for transaction processing and concurrency control. The information 
system described contains complex relations, which vary on a daily ba- 
sis. The paper examines the various update operations from relations 
creation point of view, and suggests an approach for defining these 
new relations and updating the propagation values dynamically. Thus, 
reflecting the changing nature of relations. KeyWords: Composite 

object , Message object, Propagation rules , Dynamic relation creation . 

1 Introduction 

1.1 Object-Oriented Concepts 

Object-orientation is an important modeling concept[2]. To be considered ‘object ori- 
ented’, a system has to fulfill several criteria. A basic paradigm is the notion of ob- 
jects grouped into classes, which are themselves organized in sub-class hierar- 
chies) 18]. Such a system offers three basic relation types: Instantiation(lnstance-of), 
Specialization(IS-A), Aggregation(Part-of), as described in [9], 

In such a model, when an information unit (record, tuple, sub-object) can belong to 
more than one object, or when it has an independent behavior by its nature, the need 
arises to represent it as an object by itself, and make it a part of a complex object by a 
composition relation [1,15,6]. Kim[6] distinguishes between weak and composite ref- 
erences: Composite reference ( composite in this paper) implies that the composing 
object is essentially a part of the main one it composes, may belong to a few other 
objects, but probably should not exist by itself. Weak reference ( reference in this pa- 
per) means that the composing object exists by its own, the composed object does not 
own it, but merely accesses some information it contains. 
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The sharing possibilities and composition types necessitate the definition of two 
aspects: a. Whether, and to which extent, should a composed object propagate a mes- 
sage that it receives, to its composing objects - propagation rules, b. How to effi- 
ciently perform transactions. 

One of the most relevant works on propagation is by Rumbaugh [15]. Rumbaugh 
proposed that: a) The decision as to the propagation step should be determined by 
the: Type of operation (copy, delete, save, print etc.), Type of relation (composition, 
reference), and the Class types, b) The possible propagation types: Propagate, Shal- 
low, None and Inhibit. For each combination of an operation type and a relation be- 
tween two objects there is a propagation value. Rumbaugh’ s work refers to a system 
with stable connections, so decision of propagation values can be made at the design 
stage. His ideas will be formalized in section 2. 

When dealing with complex information systems, which evolve over time, the 
static approach of Rumbaugh may not be adequate. The need arises to determine 
propagation characteristics at the time a new relation is created. This property compli- 
cates the solutions to the previously mentioned problems. It requires adding new con- 
cepts and extending the algorithm. 

The main contributions of this paper is the handling of propagation in dynamically 
evolving relations, handling concurrency and maintaining a relatively simple propa- 
gation algorithm by absorbing as much as possible of the changes in a message object. 

1.2 Informal Description of the DA System 

The information system that motivated the proposals of this paper, is that of the Dis- 
trict Attorney office of the Negev region, Israel. This is a highly complex information 
system, which deals with several types of files as the main entity [Fig. 1]. Its com- 
plexity emerges not just because of the many attributes and mutual constraints among 
them, but mainly because of the complex and dynamic relations among objects. In 
order to simplify the examples only some of the main objects will be described, and a 
very small part of their attributes. 




Fig. 1. The main DA system files. 

A simplified explanation: After a crime is committed, a police file (POLF) is pre- 
pared, which contains information about the event, and the exhibits (EXH). If there is 
a suspect (SUS), with some evidence to his involvement in the crime, the file (possi- 
bly with other files concerning the same suspect) is handled to the DA office. A 
prosecution file (PROSF) is created. At that stage it is composed of : list of police 
files, list of suspects, and categorized by the list of offenses (OFF). Next stage is as- 
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signing a responsible prosecutor (PROAT) to it, who decides to prepare (an) accusa- 
tion file(s) against some or all the suspects. Afterwards, much of the computer work is 
with the court files (COF), mainly updating and querying appearances (APP). The 
detailed relations are depicted in figure 2. 

1.3 Organization and Motivation 

In order to motivate our discussion of propagation rules, let us examine the operation 
Icopy. This operation is defined as duplicating a PROSecutionFile object from the 
main database to a private file, which belongs to a certain prosecutor who prepares 
her strategy on the case. Initiating Icopy does not imply copying everything that com- 
poses or is referenced by the original PROSecutionFile object. Rather it distinguishes 
between: Components that should not be copied because they were not an integral 
part of the original object (such as PROsecutingATtorney and OFFense, which are 
objects on their own right) or are part of the global database (like POLiceFile). Com- 
posing components that are currently at work and should be copied so that the prose- 
cutor will perform her draft work on her private copy of that data. Some components 
which describe technical data (like list of secretaries who handled the file) and we ex- 
pect the Icopy to skip them. The above behavior motivates the existence of multiple 
propagation types. 

While doing so, the components may be involved in other operations, and lock 
conflicts may arise, even for small objects, because of the distributed nature of the 
model. We seek a solution that will guarantee correctness, will be efficient, and treat 
concurrency control mechanisms (lock/unlock) in a way that will resemble the way 
information processing operations are treated. 

The rest of the paper is structured as follows. Section 2 contains basic definitions 
of composition types, propagation possibilities (attributes), and propagation criteria. It 
gives an example composition diagram with its propagation values. Section 3 intro- 
duces the basic propagation algorithm (based on [15]). It then demonstrates its appli- 
cation on the simplified PROSecutionFile object and suggests slight modifications. 
Section 4 describes the dynamic nature of relationships in the DA system through the 
Fjoin operation, regarding an operation as (eventually) a new relation with its attrib- 
utes. It then suggests treatment of an operation not just as an action and a relation, but 
also as an object having attributes which influence its behavior. As a result, a much 
clearer algorithm is presented. Section 5 introduces the concurrency control problems 
in the model, the lock levels and treatment, modifications to system-wide structures 
and an algorithm. Section 6 summarizes this approach. 

1.4 Other Directions 

Work on dynamic creation of relationships among objects was done in other direc- 
tions. An important one is Version Control. During schema evolution it might be re- 
quired to maintain several versions of classes and / or objects, for various reasons 
(smooth incorporation of a new version, ability to cancel changes). The interested 
readers are referred to [5, eh. 5] for an introduction to the subject. Specific research 
works are described in [8] on version management for sets of modified objects, [4] on 
consistency maintenance, and [13] about sub-versions, to mention only a few. 
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Another important direction is that of Adaptive Programs (AP): An information 
system is described by: Class dictionary graphs and AP’s. An AP is described by: An 
operation. Propagation patterns, wrapping code [10] , [11]. A class dictionary graph 
describes classes and relations between them. The operation is the ‘main’ instructions 
as to what type of calculations should be carried out. A propagation pattern is all the 
possible traversals between the start and end nodes, provided they obey the ‘through 
nodes’. The wrapping code is the specific instructions that should be added to solve 
the particular task [12]. So, many programs can be produced from the same class dic- 
tionary graph, using a different or the same propagation pattern. The AP approach is 
directed mainly to retrieval tasks. 



2 Basic Concepts and Data System 

In this section we formalize Rumbaugh’s [15] ideas: Basic relation types and formal 
propagation model (criteria and propagation types). Then the example classes / ob- 
jects in our DA system are listed, together with some example operations. When re- 
ferring to objects, we use M to denote a composed object and P as a composing one. 

2.1 Basic Concepts of Propagation of Operations 

As long as a compound object is considered to be a single instance of a class with ag- 
gregation relations to simpler classes, the usual data management operations are per- 
formed on the object as a whole. If an operation is to be performed on a particular ag- 
gregating component, the message is sent to the compound object (because its com- 
ponents cannot be addressed directly). The appropriate method of the compound ob- 
ject would be invoked. Then we might need variations on that method for other ob- 
jects with such a component. Now, considering the parts to be independent objects, it 
is possible to send one standard message directly to the specific one. But usually we 
would like it to be performed on 'most' of the compound object. That can be achieved 
by sending it the message, and expecting it to propagate it appropriately. Whom to 
and to which extent? According to [15] this is determined by the: Type of operation. 
Type of relation and the class, as follows: 

Type of Operation: Whether we want to copy, save, print, delete and so on, the 
main object. A print tends to be an operation which should be propagated, whereas 
for other operations propagation seems to be limited to some components only. 

Type of Relation: Aggregation, Composition, Reference, as described in sec- 

tion 1.1. Generally aggregation causes propagation, Composition may or may not, and 
Reference leads to no propagation. 

Classes involved: Which are the Main and Component classes and what is the se- 
mantics of the association between them? This makes it possible to refine the propa- 
gation value setting. 

Rumbaugh[15] defines 5 propagation types: (full) propagation, shallow, none, in- 
hibit, and error. This is an extension to previous works, which distinguished between 
deep propagation and none [2], [3]. 
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Definition 1: Propagation types. 

Propagate: The operation is performed on the composing element and relation. 
Shallow: The operation is propagated only to the relation element. 

None: The operation is not propagated at all. 

Inhibit: The operation is not applied even to the receiving element, and no 

propagation occurs. 

Error: The whole operation should be cancelled and all the composing ob- 

jects (of the compound one) should retain their previous state. 



In this paper we ignore the ‘error’ situation, and limit the ‘inhibit’ only to situa- 
tions that produce an inhibit state, rather than as a propagation value. On the other 
hand, we introduce the ‘must’ and ‘irrelevant’ propagation values. 

must: It is essential that the operation be performed on the component, as a part 

of performing the operation on the ‘self part of the composed one. 

irrelevant: The operation can’t fully propagate to the current object, so there is no 
meaning to any propagation attributes further on. 



2.2 A Simplified DA Data System 

Following is a set of data items of the DA system. It is simplified with respect to the 
types of objects mentioned and to the connections and limitations which exist practi- 
cally. Also, only very few operations will be mentioned. We start with a table of 
propagation values for each of the combinations of relations between two objects, for 
each operation, followed by a PROSF figure. Then we briefly explain each operation. 



Table 1 . Propagation values for some combinations of relations and operations for the DA sy- 
stem. ‘p’ indicating full propagation, ‘s’ shallow , ‘n’ none, must, and ‘— ’ Irrelevant. 
They are stored within each object class, for each possible relation and function. 
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And a typical PROSecution File can look like: 
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Fig. 2. Structure of a simplified PROSecutionFile 

Where the example functions are: 

Ecopy : Copies a PROSecutionFile object to a remote database, so whatever 
isn’t global should be copied (except for ‘unimportant’ data). 

Icopy : Copies a PROSecutionFile object to a private database in the same 
domain, so some information can be referenced instead of copied. 

Print : Prints principal working data for the object. 

Save : Saves the object and its components (data and / or references). 

Del : Deletes the object and its components (unless an object participates in 

composing another object). 

Fjoin : Regarding two DA files as one legal entity from now on. The operation 
is quite complex and will be described in detail in chapter 4. 



3 Basic Propagation Algorithms 



3.1 Building Blocks of Propagation 

Formalizing Rumbaugh’s ideas yields the propagate 1 algorithm. In order to present it, 
we define the following functions: 

Next Ref: Returns the ID of the next component of the compound object. 

Get Comp Obj : Gets an object ID, returns a pointer to it. 

Apply : The actual application of a specific operation on the self part. 

Update Ref : If a new object was created (by copy, version, etc.), updates 

reference between the current one and the new one. 
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Prop : Gets an operation type and a class, returns the propagation 

value from the receiving object’s class to the class given as a 
parameter, for that operation. 

Example: PROSF^proplPOLF^TypeOJicopy) = ‘p’ 

Propagate 1 is passed the operation as a parameter. It assumes pre-order processing 
and does not handle the Inhibit situation. Its default propagation is shallow. 

In propagatel, when an object receives the message, the operation is applied to its 
integral parts, its first composing or referenced object is fetched, and we encounter the 
main loop: Identifying the propagation type: propagate (designated as 5p), shal- 
low (5s) or none (5n). If a propagate is required, the composing object is sent the 
propagatel message, and each relation with each of the components of that object is 
examined for its propagation characteristic. Then, an UpdateRef message is sent, to 
handle (if necessary) changes of references / pointers. If it was shallow, only Up- 
date Ref (at most) is needed, and if none - no action is taken. 

Propagatel (Operation OP) 

VAR P t : Object ; 

E_OBJ : Object ; /* E_OBJ ^Effective object */ 

Ri : Obj ect_Ref erence ; 
ptype : Propagation_Type ; 
begin 

{1} E_OBJ := Self .Apply (OP) ; 

{2} Ri : = Self. Next_Ref() ; 

while (Ri NULL) do 

{ 3 } Pi := Self . Get_Comp_Obj (Ri) ; 

{4} ptype := Self . Prop (Pi . Type () , OP . Type ( ) ) ; 

{ 5 } case ptype of 

{ 5p } 'p' : Pi . Propagatel (OP) ; 

E_OBJ.Update_Ref (Ri) ; 

{5s} 's' : E_OBJ.Update_Ref (Ri) ; 

{5n} 'n' : no action taken 

otherwise : E_OBJ . Update_Ref (RJ ; 

end case ; 

{6} Ri := Self .Next_Ref ( ) ; 

end while; 

{ 7 } return ; 

end ; 

Propagatel: Basic propagation procedure. 



3.2 Application of Propagatel to the DA System 

Figure 3 (to be read in parallel with figure 4) shows the application of Icopy, ac- 
cording to the propagation values of table 1 . Propagation type is marked with the ap- 
propriate choice (5p means that a full propagate was chosen, 5s - shallow, 5n - none), 
with an extension of 5p_b and 5p_e for beginning/end of a deeper level of composi- 
tion, and underscores getting longer for each step down the composition tree. 




